System Design
  • Introduction
  • Glossary of System Design
    • System Design Basics
    • Key Characteristics of Distributed Systems
    • Scalability - Harvard lecture
      • Scalability for Dummies - Part 1: Clones
      • Scalability for Dummies - Part 2: Database
      • Scalability for Dummies - Part 3: Cache
      • Scalability for Dummies - Part 4: Asynchronism
    • Trade-off
      • CAP Theorem
      • Performance vs scalability
      • Latency vs throughput
      • Availability vs consistency
    • Load Balancing
      • Load balancer
    • Proxies
      • Reverse proxy
    • Cache
      • Caching
    • Asynchronism
    • Processing guarantee in Kafka
    • Database
      • Relational database management system (RDBMS)
      • Redundancy and Replication
      • Data Partitioning
      • Indexes
      • NoSQL
      • SQL vs. NoSQL
      • Consistent Hashing
    • Application layer
    • DNS
    • CDN
    • Communication
      • Long-Polling vs WebSockets vs Server-Sent Events
    • Security
    • Lambda Architecture
  • OOD design
    • Concepts
      • Object-Oriented Basics
      • OO Analysis and Design
      • What is UML?
      • Use Case Diagrams
    • Design a parking lot
  • System Design Cases
    • Overview
    • Design a system that scales to millions of users on AWS
    • Designing a URL Shortening service like TinyURL
      • Design Unique ID Generator
      • Designing Pastebin
      • Design Pastebin.com (or Bit.ly)
    • Design notification system (scott)
      • Designing notification service
    • Designing Chat System
      • Designing Slack
      • Designing Facebook Messenger
    • Design Top K System
    • Designing Instagram
    • Design a newsfeed system
      • Designing Facebook’s Newsfeed
      • Design the data structures for a social network
    • Designing Twitter
      • Design the Twitter timeline and search
      • Designing Twitter Search
    • Design Youtube - Scott
      • Design live commenting
      • Designing Youtube or Netflix
    • Designing a Web Crawler
      • Designing a distributed job scheduler
      • Designing a Web Crawler/Archive (scott)
      • Design a web crawler
    • Designing Dropbox
    • Design Google Doc
    • Design Metrics Aggregation System
      • Design Ads Logging System
    • Design Instacart
    • Design a payment system
      • Airbnb - Avoiding Double Payments in a Distributed Payments System
    • Design Distributed Message Queue
      • Cherami: Uber Engineering’s Durable and Scalable Task Queue in Go
    • Design Distributed Cache
      • Design a key-value cache to save the results of the most recent web server queries
    • Design a scalable file distribution system
    • Design Amazon's sales ranking by category feature
    • Design Mint.com
    • Design Autocomplete System
      • Designing Typeahead Suggestion
    • Designing an API Rate Limiter
      • Designing Rate Limiter
    • Design Google Map
      • Designing Yelp or Nearby Friends
      • Designing Uber backend
    • Designing Ticketmaster
      • Design 12306 - Scott
    • Design AirBnB or a Hotel Booking System
  • Paper Reading
    • MapReduce
  • Other Questions
    • What happened after you input the url in the browser?
Powered by GitBook
On this page
  • CAP Theorem
  • CP - Consistency/Partition Tolerance
  • AP - Availability/Partition Tolerance
  • Consistency patterns
  • Weak consistency
  • Eventual consistency
  • Strong consistency
  • Availability patterns
  • Fail-over
  • Replication
  • Availability in numbers
  • Availability in parallel vs in sequence

Was this helpful?

  1. Glossary of System Design
  2. Trade-off

Availability vs consistency

PreviousLatency vs throughputNextLoad Balancing

Last updated 5 years ago

Was this helpful?

CAP Theorem

The CAP Theorem states that, in a distributed system (a collection of interconnected nodes that share data.), you can only have two out of the following three guarantees across a write/read pair: Consistency, Availability, and Partition Tolerance - one of them must be sacrificed.

简而言之,就是三者(consistency, availability, partition tolerance)只能得其二。

  • Consistency - A read is guaranteed to return the most recent write for a given client. (读到的是最新的 record,而不是 stale data)

  • Availability - A non-failing node will return a reasonable response within a reasonable amount of time (no error or timeout).

  • Partition Tolerance - The system will continue to function when network partitions occur.

其实有一个误区,network 并不是一直都稳定的,这一点我们通常并没有意识到。所以我们要忍受 network partition, 它是一直会存在的。我们只能在 consistency 和 availability 上面去取舍。

CP - Consistency/Partition Tolerance

Wait for a response from the partitioned node which could result in a timeout error. The system can also choose to return an error, depending on the scenario you desire. Choose Consistency over Availability when your business requirements dictate atomic reads and writes.

AP - Availability/Partition Tolerance

Return the most recent version of the data you have, which could be stale. This system state will also accept writes that can be processed later when the partition is resolved. Choose Availability over Consistency when your business requirements allow for some flexibility around when the data in the system synchronizes. Availability is also a compelling option when the system needs to continue to function in spite of external errors (shopping carts, etc.)

AP is a good choice if the business needs allow for eventual consistency or when the system needs to continue working despite external errors.

Consistency patterns

Weak consistency

After a write, reads may or may not see it. A best effort approach is taken.

This approach is seen in systems such as memcached. Weak consistency works well in real time use cases such as VoIP, video chat, and realtime multiplayer games. For example, if you are on a phone call and lose reception for a few seconds, when you regain connection you do not hear what was spoken during connection loss.

Eventual consistency

After a write, reads will eventually see it (typically within milliseconds). Data is replicated asynchronously.

This approach is seen in systems such as DNS and email. Eventual consistency works well in highly available systems.

Strong consistency

After a write, reads will see it. Data is replicated synchronously.

This approach is seen in file systems and RDBMSes. Strong consistency works well in systems that need transactions.

Availability patterns

There are two main patterns to support high availability: fail-over and replication.

Fail-over

  1. Active-passive

With active-passive fail-over, heartbeats are sent between the active and the passive server on standby. If the heartbeat is interrupted, the passive server takes over the active's IP address and resumes service.

The length of downtime is determined by whether the passive server is already running in 'hot' standby or whether it needs to start up from 'cold' standby. Only the active server handles traffic.

Active-passive failover can also be referred to as master-slave failover.

有一个信号在 active 和 passive server 之间通信。如果这个信号中断了,passive 的 server 就会立马接管 active 的 ip 继续服务。

  1. Active-active

In active-active, both servers are managing traffic, spreading the load between them.

If the servers are public-facing, the DNS would need to know about the public IPs of both servers. If the servers are internal-facing, application logic would need to know about both servers.

Active-active failover can also be referred to as master-master failover.

相当于有两个 server 来同时 serve traffic。

  1. Disadvantage

  2. Fail-over adds more hardware and additional complexity.

  3. There is a potential for loss of data if the active system fails before any newly written data can be replicated to the passive.

Replication

主要有 master-master 和 master-slave 两种,会在 database 的部分详谈。

Availability in numbers

Availability is often quantified by uptime (or downtime) as a percentage of time the service is available. Availability is generally measured in number of 9s--a service with 99.99% availability is described as having four 9s.

Availability in parallel vs in sequence

If a service consists of multiple components prone to failure, the service's overall availability depends on whether the components are in sequence or in parallel.

  • In sequence -

    Overall availability decreases when two components with availability < 100% are in sequence:

Availability (Total) = Availability (Foo) * Availability (Bar)

  • In parallel -

    Overall availability increases when two components with availability < 100% are in parallel:

Availability (Total) = 1 - (1 - Availability (Foo)) * (1 - Availability (Bar))

https://robertgreiner.com/cap-theorem-revisited/