Designing Rate Limiter

Ref: https://www.youtube.com/watch?v=FU4WlwfS3G0

Problem Statement

Some customer suddenly has traffic spikes, we need to throttle him;
this is called "noise neighbor problem"

Can we scale up to handle this issue? Can we even use auto-scaling to handle the spike?

Not really. Even auto-scaling takes time. By the time scaling process completes, it may already be late.

How about using max connections in the load balancer and max threads count on the service endpoint. Do we still need throttling?

load balancer can indeed reject additional requests which go beyond the threshold
however, it cannot differentiate all sorts of requests. Some requests take short period of time, some requests are slow and take much longer time. They cannot be treated equally and rejected by threads number.

Is this a algorithm problem? Doesn't seem to be a distributed system design.

In the real-world case, each host processes different number of requests. We need to know each number of processing requests and have a solution that hosts will share info about how many requests processed so far.

Requirements

easy integration should be another consideration. So any company/architecture can use it.

Single Host Case

Interviewer may now ask in 3 directions: throttling algorithm, OOD, or distributed throttling solution. Let's explore them one by one.

Throttling Algorithm

Token bucket algorithm is a simple one to implement. We can also refer to Google Guava RateLimiter class.

Token Bucket Algorithm

Properties for this algorithm(bucket):

Bucket has a maximum capacity
how many tokens are there now in bucket
request comes in -> a token moves out
tokens arrive at constant rate

OOD

Distributed Throttling For Multiple Hosts

First look at this question

If service can handle 4 requests/sec (per client), how many tokens should we put in each bucket?

Should we put 4/3 for each bucket? No. The answer is 4 for each bucket. The reason is that all 4 requests could land on the same host(bucket).

we executed 4 requests, now we should throttle the remaining ones
allow hosts to talk to each other, sharing how many tokens they used altogether

in the above example, you can see sometimes some host may take more requests than expected

How hosts communicate with each other?

Option 1: (full mesh) Tell everyone everything

every host in the cluster know every other host in the cluster and share messages with each of them.

how do hosts discover each other?

method 1: 3rd party service that listens to heartbeats coming from every host. If heartbeats come, the service will keep registering them; if not, the service will unregister host that is no longer alive. All hosts ask this 3rd party service for all other hosts info
method 2: use VIP. we can use VIP info to get all the members. Or relying on customer's config file, which needs to be deployed to all hosts in the cluster. Every time config changes, it needs to deploy again.
limitation for full message broadcasting: easy to implement but hard to scale
Option 2: Gossip protocol

random "peer selection": with a given frequency, each machine picks another machine at random and shares data
Option 3: Distributed cache

distributed cache cluster is relatively small and service cluster can scale out independently
Option 4: Coordination service

coordination service helps to choose a leader, which helps to decrease number of messages broadcasted within the cluster
leader asks hosts to send their info; leader will then collect the info and send it back to each of them
each of the hosts only need to chat with the leader
Option 5: Random leader selection

randomly choose a leader
may have multiple leaders but not affect the results

What about communication protocols?

TCP vs UDP
TCP guarantees delivery of data and also guarantees that packets will be delivered in the same order
UDP doesn't guarantee you get all the packets and order is not guaranteed neither; but UDP is faster

How to choose them in this case?

Depends on your needs: if you want the throttling performance to be accurate, go with TCP; if you want the service fast, choose UDP

Integrate above artifacts with the service

Option 1: rate limiter run as part of the service process

a library of rate limiter should intergrated with service code
faster: no need to do inter-process calls
resilient to inter-process calls

Option 2: rate limiter run as its own process(daemon)

two libraries: daemon and the client; client intergrated with service code
programming language agnostic(use different language to implement rate limiter)
rate limiter use its own memory space. limter may create many buckets in the memory and it won't impact service process
easier to deal with service team paranoia(whether this library will introduce bugs to our service?)

Both options are good. It depends on different scenarios to use different solutions.

One final look

PreviousDesigning an API Rate Limiter NextDesign Google Map

Last updated 3 years ago

Was this helpful?