System Design
  • Introduction
  • Glossary of System Design
    • System Design Basics
    • Key Characteristics of Distributed Systems
    • Scalability - Harvard lecture
      • Scalability for Dummies - Part 1: Clones
      • Scalability for Dummies - Part 2: Database
      • Scalability for Dummies - Part 3: Cache
      • Scalability for Dummies - Part 4: Asynchronism
    • Trade-off
      • CAP Theorem
      • Performance vs scalability
      • Latency vs throughput
      • Availability vs consistency
    • Load Balancing
      • Load balancer
    • Proxies
      • Reverse proxy
    • Cache
      • Caching
    • Asynchronism
    • Processing guarantee in Kafka
    • Database
      • Relational database management system (RDBMS)
      • Redundancy and Replication
      • Data Partitioning
      • Indexes
      • NoSQL
      • SQL vs. NoSQL
      • Consistent Hashing
    • Application layer
    • DNS
    • CDN
    • Communication
      • Long-Polling vs WebSockets vs Server-Sent Events
    • Security
    • Lambda Architecture
  • OOD design
    • Concepts
      • Object-Oriented Basics
      • OO Analysis and Design
      • What is UML?
      • Use Case Diagrams
    • Design a parking lot
  • System Design Cases
    • Overview
    • Design a system that scales to millions of users on AWS
    • Designing a URL Shortening service like TinyURL
      • Design Unique ID Generator
      • Designing Pastebin
      • Design Pastebin.com (or Bit.ly)
    • Design notification system (scott)
      • Designing notification service
    • Designing Chat System
      • Designing Slack
      • Designing Facebook Messenger
    • Design Top K System
    • Designing Instagram
    • Design a newsfeed system
      • Designing Facebook’s Newsfeed
      • Design the data structures for a social network
    • Designing Twitter
      • Design the Twitter timeline and search
      • Designing Twitter Search
    • Design Youtube - Scott
      • Design live commenting
      • Designing Youtube or Netflix
    • Designing a Web Crawler
      • Designing a distributed job scheduler
      • Designing a Web Crawler/Archive (scott)
      • Design a web crawler
    • Designing Dropbox
    • Design Google Doc
    • Design Metrics Aggregation System
      • Design Ads Logging System
    • Design Instacart
    • Design a payment system
      • Airbnb - Avoiding Double Payments in a Distributed Payments System
    • Design Distributed Message Queue
      • Cherami: Uber Engineering’s Durable and Scalable Task Queue in Go
    • Design Distributed Cache
      • Design a key-value cache to save the results of the most recent web server queries
    • Design a scalable file distribution system
    • Design Amazon's sales ranking by category feature
    • Design Mint.com
    • Design Autocomplete System
      • Designing Typeahead Suggestion
    • Designing an API Rate Limiter
      • Designing Rate Limiter
    • Design Google Map
      • Designing Yelp or Nearby Friends
      • Designing Uber backend
    • Designing Ticketmaster
      • Design 12306 - Scott
    • Design AirBnB or a Hotel Booking System
  • Paper Reading
    • MapReduce
  • Other Questions
    • What happened after you input the url in the browser?
Powered by GitBook
On this page
  • Design Distributed Cache
  • Problem statement
  • Issues with this setup
  • Single server case
  • Cache eviction strategy - LRU
  • Stepping into the distributed world
  • Dedicated cache cluster
  • Co-located cache
  • Choose a cache host to call (naive approach)
  • Choose a cache host to call (consistent hashing)
  • Who is route these cache to the cache server? - Cache Client
  • Maintaining a list of cache servers
  • Achieving high availability
  • Protocol Option 1: probabilistic protocols (e.g. gossip, epidemic broadcast trees, bimodal multicast)
  • Protocol Option 2: consensus protocols (e.g. 2 or 3 phase commit, paxos, raft chain replication)
  • Master-slave(leader-follower) replication
  • What else is important?
  • Consistency
  • Data expiration
  • Local and remote cache
  • Security
  • Monitoring and logging
  • Cache client
  • Consistent hashing
  • Summary

Was this helpful?

  1. System Design Cases

Design Distributed Cache

PreviousCherami: Uber Engineering’s Durable and Scalable Task Queue in GoNextDesign a key-value cache to save the results of the most recent web server queries

Last updated 3 years ago

Was this helpful?

Design Distributed Cache

Ref:

Problem statement

Issues with this setup

  • calls to the data store may take a long time to execute or may utilize a lot of system resources. Some data should be stored in memory and return faster to client.

  • if the datastore is down or slow, it would be good to serve requests as normal

Why it is called "distributed" cache?

  • Because the data is large and needs to be stored across several machines.


Single server case

Cache eviction strategy - LRU

  • Refer to LC LRU cache implementation


Stepping into the distributed world

Dedicated cache cluster

  • Isolation of resources between service and cache. Both the cache and the service do not share memory and CPU any more.

  • Can be used by multiple services. e.g. utilize the same cluster across several microservices our team owns

  • Flexibility in choosing hardware

Co-located cache

  • No extra hardware and operation cost

  • Scale together with the service

Choose a cache host to call (naive approach)

  • Mod hashing is a poor choice

Choose a cache host to call (consistent hashing)

  • minimize a number of keys we need to re-hash(only new added host, on the clockwise range that need to re-hash)

Who is route these cache to the cache server? - Cache Client

Maintaining a list of cache servers

Option 1: Use configuration management tools

  • every time need to deploy modified files to every service host

Option 2: Use external storage (e.g. S3)

  • put the file to the shared storage

  • service hosts poll for the file periodically

  • each service host needs a deamon process to poll data periodically

  • drawback is that we still need to maintain the file manually; also don't know the health of the cache services

Option 3: Configuration service (e.g. Zookeeper)

  • discover cache hosts and monitor their health

  • each cache server register itself with configuration service and send heartbeats to it periodically

  • as long as heartbeats come, server keeps being registered in the system; if heartbeats stop coming, the configuration service unregisters a cache server that is no longer alive or inaccessible

  • every cache client grabs the list of registered cache servers from the configuration service


Achieving high availability

availability 一般讨论的就是 replication!

Protocol Option 1: probabilistic protocols (e.g. gossip, epidemic broadcast trees, bimodal multicast)

  • favor eventual consistency

Protocol Option 2: consensus protocols (e.g. 2 or 3 phase commit, paxos, raft chain replication)

  • favor strong consistency


Master-slave(leader-follower) replication

  • replica copy across all shards, easy to deal with hot shards - scale out by adding more read replica

  • for leader election: we can use a configuration service(like, zookeeper). It's the source of truth about whether the service is alive or not. If the leader is down, it will promote another leader.

  • Redis also implemented Redis Sentinel for this purpose.


What else is important?

Consistency

  • Several reasons cause inconsistency:

    • replicate data asynchronously

    • clients have different lists of cache servers (cache servers may go up and go down again, and a server has values that no other clients can read)

Data expiration

  • We use LRU cache eviction. If cache is not full, some items may sit there for a long time and such item may become stale.

  • To fix it: we could introduce some metadata for a cache entry and include time-to-live attribute.

    • Option 1 (passively expire an item): when some client tries to access it, and the item is found to be expired.

    • Option 2 (actively expire an item): create a maintenance thread that runs at regular intervals and remove expired items

Local and remote cache

  • implement a support for the local cache inside the cache client

  • use LRU cache or use 3rd party Guava cache.

Security

  • use a firewall(have some open ports) to ensure only approved clients can access the cache.

  • clients may also encrypt data before storing in cache and decrypt it on the way out.

Monitoring and logging

  • number of faults while calling the cache, latency, number of hits and misses, CPU and memory utilization on cache hosts, network I/O

  • logging should be small but useful: capture the details of every request to the cache.

    • e.g.: who and when accessed the cache; what was the key and return status code

Cache client

  • responsibilities of a cache client:

    • maintain a list of cache servers

    • pick a shard to route a request to

    • handle a remote call and any potential failures

    • emit metrics

  • make it simpler:

    • option 1: introduce a proxy, that will sit between cache clients and cache servers and will be responsible for picking a cache shard. e.g. twemporxy project created by Twitter.

    • option 2: make cache servers responsible for picking a shard. Client sends request to a random cache server and cache server applies consistent hashing and redirects request to the shard that stores the data. This idea is utilized by Redis cluster.

Consistent hashing

  • major flaws:

    • domino effect: cache server do not split the circle evenly. If happens when one server dies and transfer all loads to the next one, then the next cannot handle this and dies. Thus it might cause chain reaction of failures.

    • some servers reside closer and some may be far apart, causing uneven distribution of keys among the cache servers.

  • To fix them:


Summary

add each server on the circle multiple times. Ref . or Proportional hashing(algorithm used by Yahoo!).

Jump Hash Algorithm
original video