Design Distributed Cache

Ref: original video

Problem statement

Issues with this setup

calls to the data store may take a long time to execute or may utilize a lot of system resources. Some data should be stored in memory and return faster to client.
if the datastore is down or slow, it would be good to serve requests as normal

Why it is called "distributed" cache?

Because the data is large and needs to be stored across several machines.

Single server case

Cache eviction strategy - LRU

Refer to LC LRU cache implementation

Stepping into the distributed world

Dedicated cache cluster

Isolation of resources between service and cache. Both the cache and the service do not share memory and CPU any more.
Can be used by multiple services. e.g. utilize the same cluster across several microservices our team owns
Flexibility in choosing hardware

Co-located cache

No extra hardware and operation cost
Scale together with the service

Choose a cache host to call (naive approach)

Mod hashing is a poor choice

Choose a cache host to call (consistent hashing)

minimize a number of keys we need to re-hash(only new added host, on the clockwise range that need to re-hash)

Who is route these cache to the cache server? - Cache Client

Maintaining a list of cache servers

Option 1: Use configuration management tools

every time need to deploy modified files to every service host

Option 2: Use external storage (e.g. S3)

put the file to the shared storage
service hosts poll for the file periodically
each service host needs a deamon process to poll data periodically
drawback is that we still need to maintain the file manually; also don't know the health of the cache services

Option 3: Configuration service (e.g. Zookeeper)

discover cache hosts and monitor their health
each cache server register itself with configuration service and send heartbeats to it periodically
as long as heartbeats come, server keeps being registered in the system; if heartbeats stop coming, the configuration service unregisters a cache server that is no longer alive or inaccessible
every cache client grabs the list of registered cache servers from the configuration service

Achieving high availability

availability 一般讨论的就是 replication！

Protocol Option 1: probabilistic protocols (e.g. gossip, epidemic broadcast trees, bimodal multicast)

favor eventual consistency

Protocol Option 2: consensus protocols (e.g. 2 or 3 phase commit, paxos, raft chain replication)

favor strong consistency

Master-slave(leader-follower) replication

replica copy across all shards, easy to deal with hot shards - scale out by adding more read replica
for leader election: we can use a configuration service(like, zookeeper). It's the source of truth about whether the service is alive or not. If the leader is down, it will promote another leader.
Redis also implemented Redis Sentinel for this purpose.

What else is important?

Consistency

Several reasons cause inconsistency:
- replicate data asynchronously
- clients have different lists of cache servers (cache servers may go up and go down again, and a server has values that no other clients can read)

Data expiration

We use LRU cache eviction. If cache is not full, some items may sit there for a long time and such item may become stale.
To fix it: we could introduce some metadata for a cache entry and include time-to-live attribute.
- Option 1 (passively expire an item): when some client tries to access it, and the item is found to be expired.
- Option 2 (actively expire an item): create a maintenance thread that runs at regular intervals and remove expired items

Local and remote cache

implement a support for the local cache inside the cache client
use LRU cache or use 3rd party Guava cache.

Security

use a firewall(have some open ports) to ensure only approved clients can access the cache.
clients may also encrypt data before storing in cache and decrypt it on the way out.

Monitoring and logging

number of faults while calling the cache, latency, number of hits and misses, CPU and memory utilization on cache hosts, network I/O
logging should be small but useful: capture the details of every request to the cache.
- e.g.: who and when accessed the cache; what was the key and return status code

Cache client

responsibilities of a cache client:
- maintain a list of cache servers
- pick a shard to route a request to
- handle a remote call and any potential failures
- emit metrics
make it simpler:
- option 1: introduce a proxy, that will sit between cache clients and cache servers and will be responsible for picking a cache shard. e.g. twemporxy project created by Twitter.
- option 2: make cache servers responsible for picking a shard. Client sends request to a random cache server and cache server applies consistent hashing and redirects request to the shard that stores the data. This idea is utilized by Redis cluster.

Consistent hashing

major flaws:
- domino effect: cache server do not split the circle evenly. If happens when one server dies and transfer all loads to the next one, then the next cannot handle this and dies. Thus it might cause chain reaction of failures.
- some servers reside closer and some may be far apart, causing uneven distribution of keys among the cache servers.
To fix them:
- add each server on the circle multiple times. Ref Jump Hash Algorithm. or Proportional hashing(algorithm used by Yahoo!).

Summary

PreviousCherami: Uber Engineering’s Durable and Scalable Task Queue in Go NextDesign a key-value cache to save the results of the most recent web server queries

Last updated 3 years ago

Was this helpful?