Design Distributed Cache

Design Distributed Cache

Ref: original video

Problem statement

Issues with this setup

  • calls to the data store may take a long time to execute or may utilize a lot of system resources. Some data should be stored in memory and return faster to client.

  • if the datastore is down or slow, it would be good to serve requests as normal

Why it is called "distributed" cache?

  • Because the data is large and needs to be stored across several machines.


Single server case

Cache eviction strategy - LRU

  • Refer to LC LRU cache implementation


Stepping into the distributed world

Dedicated cache cluster

  • Isolation of resources between service and cache. Both the cache and the service do not share memory and CPU any more.

  • Can be used by multiple services. e.g. utilize the same cluster across several microservices our team owns

  • Flexibility in choosing hardware

Co-located cache

  • No extra hardware and operation cost

  • Scale together with the service

Choose a cache host to call (naive approach)

  • Mod hashing is a poor choice

Choose a cache host to call (consistent hashing)

  • minimize a number of keys we need to re-hash(only new added host, on the clockwise range that need to re-hash)

Who is route these cache to the cache server? - Cache Client

Maintaining a list of cache servers

Option 1: Use configuration management tools

  • every time need to deploy modified files to every service host

Option 2: Use external storage (e.g. S3)

  • put the file to the shared storage

  • service hosts poll for the file periodically

  • each service host needs a deamon process to poll data periodically

  • drawback is that we still need to maintain the file manually; also don't know the health of the cache services

Option 3: Configuration service (e.g. Zookeeper)

  • discover cache hosts and monitor their health

  • each cache server register itself with configuration service and send heartbeats to it periodically

  • as long as heartbeats come, server keeps being registered in the system; if heartbeats stop coming, the configuration service unregisters a cache server that is no longer alive or inaccessible

  • every cache client grabs the list of registered cache servers from the configuration service


Achieving high availability

availability 一般讨论的就是 replication!

Protocol Option 1: probabilistic protocols (e.g. gossip, epidemic broadcast trees, bimodal multicast)

  • favor eventual consistency

Protocol Option 2: consensus protocols (e.g. 2 or 3 phase commit, paxos, raft chain replication)

  • favor strong consistency


Master-slave(leader-follower) replication

  • replica copy across all shards, easy to deal with hot shards - scale out by adding more read replica

  • for leader election: we can use a configuration service(like, zookeeper). It's the source of truth about whether the service is alive or not. If the leader is down, it will promote another leader.

  • Redis also implemented Redis Sentinel for this purpose.


What else is important?

Consistency

  • Several reasons cause inconsistency:

    • replicate data asynchronously

    • clients have different lists of cache servers (cache servers may go up and go down again, and a server has values that no other clients can read)

Data expiration

  • We use LRU cache eviction. If cache is not full, some items may sit there for a long time and such item may become stale.

  • To fix it: we could introduce some metadata for a cache entry and include time-to-live attribute.

    • Option 1 (passively expire an item): when some client tries to access it, and the item is found to be expired.

    • Option 2 (actively expire an item): create a maintenance thread that runs at regular intervals and remove expired items

Local and remote cache

  • implement a support for the local cache inside the cache client

  • use LRU cache or use 3rd party Guava cache.

Security

  • use a firewall(have some open ports) to ensure only approved clients can access the cache.

  • clients may also encrypt data before storing in cache and decrypt it on the way out.

Monitoring and logging

  • number of faults while calling the cache, latency, number of hits and misses, CPU and memory utilization on cache hosts, network I/O

  • logging should be small but useful: capture the details of every request to the cache.

    • e.g.: who and when accessed the cache; what was the key and return status code

Cache client

  • responsibilities of a cache client:

    • maintain a list of cache servers

    • pick a shard to route a request to

    • handle a remote call and any potential failures

    • emit metrics

  • make it simpler:

    • option 1: introduce a proxy, that will sit between cache clients and cache servers and will be responsible for picking a cache shard. e.g. twemporxy project created by Twitter.

    • option 2: make cache servers responsible for picking a shard. Client sends request to a random cache server and cache server applies consistent hashing and redirects request to the shard that stores the data. This idea is utilized by Redis cluster.

Consistent hashing

  • major flaws:

    • domino effect: cache server do not split the circle evenly. If happens when one server dies and transfer all loads to the next one, then the next cannot handle this and dies. Thus it might cause chain reaction of failures.

    • some servers reside closer and some may be far apart, causing uneven distribution of keys among the cache servers.

  • To fix them:

    • add each server on the circle multiple times. Ref Jump Hash Algorithm. or Proportional hashing(algorithm used by Yahoo!).


Summary

Last updated