Design Distributed Cache
Last updated
Was this helpful?
Last updated
Was this helpful?
Ref: original video
calls to the data store may take a long time to execute or may utilize a lot of system resources. Some data should be stored in memory and return faster to client.
if the datastore is down or slow, it would be good to serve requests as normal
Why it is called "distributed" cache?
Because the data is large and needs to be stored across several machines.
Refer to LC LRU cache implementation
Isolation of resources between service and cache. Both the cache and the service do not share memory and CPU any more.
Can be used by multiple services. e.g. utilize the same cluster across several microservices our team owns
Flexibility in choosing hardware
No extra hardware and operation cost
Scale together with the service
Mod hashing is a poor choice
minimize a number of keys we need to re-hash(only new added host, on the clockwise range that need to re-hash)
every time need to deploy modified files to every service host
put the file to the shared storage
service hosts poll for the file periodically
each service host needs a deamon process to poll data periodically
drawback is that we still need to maintain the file manually; also don't know the health of the cache services
discover cache hosts and monitor their health
each cache server register itself with configuration service and send heartbeats to it periodically
as long as heartbeats come, server keeps being registered in the system; if heartbeats stop coming, the configuration service unregisters a cache server that is no longer alive or inaccessible
every cache client grabs the list of registered cache servers from the configuration service
availability 一般讨论的就是 replication!
favor eventual consistency
favor strong consistency
replica copy across all shards, easy to deal with hot shards - scale out by adding more read replica
for leader election: we can use a configuration service(like, zookeeper). It's the source of truth about whether the service is alive or not. If the leader is down, it will promote another leader.
Redis also implemented Redis Sentinel for this purpose.
Several reasons cause inconsistency:
replicate data asynchronously
clients have different lists of cache servers (cache servers may go up and go down again, and a server has values that no other clients can read)
We use LRU cache eviction. If cache is not full, some items may sit there for a long time and such item may become stale.
To fix it: we could introduce some metadata for a cache entry and include time-to-live
attribute.
Option 1 (passively expire an item): when some client tries to access it, and the item is found to be expired.
Option 2 (actively expire an item): create a maintenance thread that runs at regular intervals and remove expired items
implement a support for the local cache inside the cache client
use LRU cache or use 3rd party Guava cache.
use a firewall(have some open ports) to ensure only approved clients can access the cache.
clients may also encrypt data before storing in cache and decrypt it on the way out.
number of faults while calling the cache, latency, number of hits and misses, CPU and memory utilization on cache hosts, network I/O
logging should be small but useful: capture the details of every request to the cache.
e.g.: who and when accessed the cache; what was the key and return status code
responsibilities of a cache client:
maintain a list of cache servers
pick a shard to route a request to
handle a remote call and any potential failures
emit metrics
make it simpler:
option 1: introduce a proxy, that will sit between cache clients and cache servers and will be responsible for picking a cache shard. e.g. twemporxy project created by Twitter.
option 2: make cache servers responsible for picking a shard. Client sends request to a random cache server and cache server applies consistent hashing and redirects request to the shard that stores the data. This idea is utilized by Redis cluster.
major flaws:
domino effect: cache server do not split the circle evenly. If happens when one server dies and transfer all loads to the next one, then the next cannot handle this and dies. Thus it might cause chain reaction of failures.
some servers reside closer and some may be far apart, causing uneven distribution of keys among the cache servers.
To fix them:
add each server on the circle multiple times. Ref Jump Hash Algorithm. or Proportional hashing(algorithm used by Yahoo!).