Design Distributed Message Queue

Ref: original video

Problem Statement

What happens if LB goes down?

LB uses primary and secondry nodes.
The primary node accepts connections and serves requests while the secondary node monitors the primary. If primary node doesn't respond, the secondary node takes over.

What if traffic increases and reaches LB's limit?

As for scalablity concerns, a concept fo multiple VIPs can be utilized.
By spreading load balancers across several data centers, we improve both avalibility and performance.

API is a good option to control over queue configuration parameters. Delete queue is a little bit controversial as it may cause a lot of harm and must be executed with caution.

One option is not to delete a message right after it was consumed. Messages can be deleted several days later by a job. This is used by Kafka.
- We need to maintain some kind of an order for messages in the queue and keep track of offset, which is the position of a message within a queue.
Another option is used by Amazon SQS. Messages are also not deleted immediately, but marked as invisible. So other consumers may not get already retrieved message.
- Consumer that retrieved the message, needs to then call delete message API to delete the messages from a backend host.
- And if the message was not explicitly deleted by a consumer, message becomes visible and may be delivered and processed twice.

Synchronously replication: when backend host receives new message, it waits until data is replicated to other hosts. And only if replication is fully completed, successful response is returned to a producer.
- higher durability but with a cost of higher latency for send message operation.
Asynchronous replication: response is returned back to a producer as soon as a message is stored on a single backend host. Message is later replicated to other hosts.
- more performant, but not guarantee that message will survive backend host failure.

Pull model: consumer constantly sends retrieve message requests and when new message is available in the queue, it is sent back to a consumer.
Push model: consumer is not constantly bombarding FrontEnd service with receive calls. Instead, consumer is notified as soon as new message arrives to the queue.
From producer side, pull is easy to implement compared to push. But from a consumer perspective, we need to do more work if we pull.

It's hard to maintain the order. Some system either not guarantee strict order or have limitations around throughput.

We need to monitor health of our distributed queue system and give customers ability to track state of their queues.
Each service we built has to emit metrics and write log data.

Yes. Every component is scalable. When load increases, we just add more load balancers, more FrontEnd hosts, more Metadata service cache shards, more backend clusters and hosts.

Yes. No single point of failure. Each component is deployed across several data centers.

Yes. We replicate data while storing and ensure messages are not lost during the transfer from a producer to a consumer.

Last updated 3 years ago

Was this helpful?