Design Distributed Cache
Design Distributed Cache
Ref: original video
Problem statement
Issues with this setup
calls to the data store may take a long time to execute or may utilize a lot of system resources. Some data should be stored in memory and return faster to client.
if the datastore is down or slow, it would be good to serve requests as normal
Why it is called "distributed" cache?
Because the data is large and needs to be stored across several machines.
Single server case
Cache eviction strategy - LRU
Refer to LC LRU cache implementation
Stepping into the distributed world
Dedicated cache cluster
Isolation of resources between service and cache. Both the cache and the service do not share memory and CPU any more.
Can be used by multiple services. e.g. utilize the same cluster across several microservices our team owns
Flexibility in choosing hardware
Co-located cache
No extra hardware and operation cost
Scale together with the service
Choose a cache host to call (naive approach)
Mod hashing is a poor choice
Choose a cache host to call (consistent hashing)
minimize a number of keys we need to re-hash(only new added host, on the clockwise range that need to re-hash)
Who is route these cache to the cache server? - Cache Client
Maintaining a list of cache servers
Option 1: Use configuration management tools
every time need to deploy modified files to every service host
Option 2: Use external storage (e.g. S3)
put the file to the shared storage
service hosts poll for the file periodically
each service host needs a deamon process to poll data periodically
drawback is that we still need to maintain the file manually; also don't know the health of the cache services
Option 3: Configuration service (e.g. Zookeeper)
discover cache hosts and monitor their health
each cache server register itself with configuration service and send heartbeats to it periodically
as long as heartbeats come, server keeps being registered in the system; if heartbeats stop coming, the configuration service unregisters a cache server that is no longer alive or inaccessible
every cache client grabs the list of registered cache servers from the configuration service
Achieving high availability
availability 一般讨论的就是 replication!
Protocol Option 1: probabilistic protocols (e.g. gossip, epidemic broadcast trees, bimodal multicast)
favor eventual consistency
Protocol Option 2: consensus protocols (e.g. 2 or 3 phase commit, paxos, raft chain replication)
favor strong consistency
Master-slave(leader-follower) replication
replica copy across all shards, easy to deal with hot shards - scale out by adding more read replica
for leader election: we can use a configuration service(like, zookeeper). It's the source of truth about whether the service is alive or not. If the leader is down, it will promote another leader.
Redis also implemented Redis Sentinel for this purpose.
What else is important?
Consistency
Several reasons cause inconsistency:
replicate data asynchronously
clients have different lists of cache servers (cache servers may go up and go down again, and a server has values that no other clients can read)
Data expiration
We use LRU cache eviction. If cache is not full, some items may sit there for a long time and such item may become stale.
To fix it: we could introduce some metadata for a cache entry and include
time-to-live
attribute.Option 1 (passively expire an item): when some client tries to access it, and the item is found to be expired.
Option 2 (actively expire an item): create a maintenance thread that runs at regular intervals and remove expired items
Local and remote cache
implement a support for the local cache inside the cache client
use LRU cache or use 3rd party Guava cache.
Security
use a firewall(have some open ports) to ensure only approved clients can access the cache.
clients may also encrypt data before storing in cache and decrypt it on the way out.
Monitoring and logging
number of faults while calling the cache, latency, number of hits and misses, CPU and memory utilization on cache hosts, network I/O
logging should be small but useful: capture the details of every request to the cache.
e.g.: who and when accessed the cache; what was the key and return status code
Cache client
responsibilities of a cache client:
maintain a list of cache servers
pick a shard to route a request to
handle a remote call and any potential failures
emit metrics
make it simpler:
option 1: introduce a proxy, that will sit between cache clients and cache servers and will be responsible for picking a cache shard. e.g. twemporxy project created by Twitter.
option 2: make cache servers responsible for picking a shard. Client sends request to a random cache server and cache server applies consistent hashing and redirects request to the shard that stores the data. This idea is utilized by Redis cluster.
Consistent hashing
major flaws:
domino effect: cache server do not split the circle evenly. If happens when one server dies and transfer all loads to the next one, then the next cannot handle this and dies. Thus it might cause chain reaction of failures.
some servers reside closer and some may be far apart, causing uneven distribution of keys among the cache servers.
To fix them:
add each server on the circle multiple times. Ref Jump Hash Algorithm. or Proportional hashing(algorithm used by Yahoo!).
Summary
Last updated