Designing Rate Limiter

Ref: https://www.youtube.com/watch?v=FU4WlwfS3G0

Problem Statement

  • Some customer suddenly has traffic spikes, we need to throttle him;

  • this is called "noise neighbor problem"

Can we scale up to handle this issue? Can we even use auto-scaling to handle the spike?

Not really. Even auto-scaling takes time. By the time scaling process completes, it may already be late.

How about using max connections in the load balancer and max threads count on the service endpoint. Do we still need throttling?

  • load balancer can indeed reject additional requests which go beyond the threshold

  • however, it cannot differentiate all sorts of requests. Some requests take short period of time, some requests are slow and take much longer time. They cannot be treated equally and rejected by threads number.

Is this a algorithm problem? Doesn't seem to be a distributed system design.

In the real-world case, each host processes different number of requests. We need to know each number of processing requests and have a solution that hosts will share info about how many requests processed so far.

Requirements

  • easy integration should be another consideration. So any company/architecture can use it.

Single Host Case

Interviewer may now ask in 3 directions: throttling algorithm, OOD, or distributed throttling solution. Let's explore them one by one.

Throttling Algorithm

Token bucket algorithm is a simple one to implement. We can also refer to Google Guava RateLimiter class.

Token Bucket Algorithm

Properties for this algorithm(bucket):

  • Bucket has a maximum capacity

  • how many tokens are there now in bucket

  • request comes in -> a token moves out

  • tokens arrive at constant rate

OOD

Distributed Throttling For Multiple Hosts

First look at this question

If service can handle 4 requests/sec (per client), how many tokens should we put in each bucket?

  • Should we put 4/3 for each bucket? No. The answer is 4 for each bucket. The reason is that all 4 requests could land on the same host(bucket).

  • we executed 4 requests, now we should throttle the remaining ones

  • allow hosts to talk to each other, sharing how many tokens they used altogether

  • in the above example, you can see sometimes some host may take more requests than expected

How hosts communicate with each other?

  • Option 1: (full mesh) Tell everyone everything

  • every host in the cluster know every other host in the cluster and share messages with each of them.

how do hosts discover each other?

  • method 1: 3rd party service that listens to heartbeats coming from every host. If heartbeats come, the service will keep registering them; if not, the service will unregister host that is no longer alive. All hosts ask this 3rd party service for all other hosts info

  • method 2: use VIP. we can use VIP info to get all the members. Or relying on customer's config file, which needs to be deployed to all hosts in the cluster. Every time config changes, it needs to deploy again.

  • limitation for full message broadcasting: easy to implement but hard to scale

  • Option 2: Gossip protocol

  • random "peer selection": with a given frequency, each machine picks another machine at random and shares data

  • Option 3: Distributed cache

  • distributed cache cluster is relatively small and service cluster can scale out independently

  • Option 4: Coordination service

  • coordination service helps to choose a leader, which helps to decrease number of messages broadcasted within the cluster

  • leader asks hosts to send their info; leader will then collect the info and send it back to each of them

  • each of the hosts only need to chat with the leader

  • Option 5: Random leader selection

  • randomly choose a leader

  • may have multiple leaders but not affect the results

What about communication protocols?

  • TCP vs UDP

  • TCP guarantees delivery of data and also guarantees that packets will be delivered in the same order

  • UDP doesn't guarantee you get all the packets and order is not guaranteed neither; but UDP is faster

How to choose them in this case?

  • Depends on your needs: if you want the throttling performance to be accurate, go with TCP; if you want the service fast, choose UDP

Integrate above artifacts with the service

Option 1: rate limiter run as part of the service process

  • a library of rate limiter should intergrated with service code

  • faster: no need to do inter-process calls

  • resilient to inter-process calls

Option 2: rate limiter run as its own process(daemon)

  • two libraries: daemon and the client; client intergrated with service code

  • programming language agnostic(use different language to implement rate limiter)

  • rate limiter use its own memory space. limter may create many buckets in the memory and it won't impact service process

  • easier to deal with service team paranoia(whether this library will introduce bugs to our service?)

Both options are good. It depends on different scenarios to use different solutions.

Other Questions

This service is insanely popular with millions of users. Does this mean millions of buckets are stored in memory?

In theory, there are millions of buckets created and stored in the memory. For example, millions of requests from same client are coming at the same second, millions of buckets will be created at this point. However, if the tps decreases later, these buckets will be removed from the memory.

Tell me about some failure modes.

  • Daemon can fail, causing other hosts in the cluster lose visibility of this failed daemon. A host with a failed daemon will continue throttle requests without talking to other hosts. Nothing really bad happens.

  • Similarly, several hosts in the cluster failed to broadcast messages int the group. Less requests will be throttled(more are allowed).

It may become a nightmare to manage all these rule configurations, right?

We can create a self-service tool so that customer can create, update and delete their rules if needed.

Will synchronization become the bottleneck?

  • synchronization in the token bucket. Better way to implement thread-safe class, for example, using atomic references.

  • synchronization in token bucket cache. If we have too many buckets stored in the cache and we want to delete unused buckets and re-create them when needed, we will end up with synchronization. We can use ConcurrentHashMap which is thread-safe.

What should clients do when their requests are throttled?

  • exponential backoff and jitter: it retries requests exponentially, increasing the waiting time between retries up to a maximum backoff time. In other words, we retry requests several times but wait a bit longer with every retry attempt. And jitter adds randomness to retry intervals to spread out the load. If we don't add jitter, backoff algorithm will retry requests at the same time. So jitter helps to separate retries.

One final look

Last updated