Availability vs consistency

CAP Theorem

https://robertgreiner.com/cap-theorem-revisited/

The CAP Theorem states that, in a distributed system (a collection of interconnected nodes that share data.), you can only have two out of the following three guarantees across a write/read pair: Consistency, Availability, and Partition Tolerance - one of them must be sacrificed.

简而言之,就是三者(consistency, availability, partition tolerance)只能得其二。

  • Consistency - A read is guaranteed to return the most recent write for a given client. (读到的是最新的 record,而不是 stale data)

  • Availability - A non-failing node will return a reasonable response within a reasonable amount of time (no error or timeout).

  • Partition Tolerance - The system will continue to function when network partitions occur.

其实有一个误区,network 并不是一直都稳定的,这一点我们通常并没有意识到。所以我们要忍受 network partition, 它是一直会存在的。我们只能在 consistency 和 availability 上面去取舍。

CP - Consistency/Partition Tolerance

Wait for a response from the partitioned node which could result in a timeout error. The system can also choose to return an error, depending on the scenario you desire. Choose Consistency over Availability when your business requirements dictate atomic reads and writes.

AP - Availability/Partition Tolerance

Return the most recent version of the data you have, which could be stale. This system state will also accept writes that can be processed later when the partition is resolved. Choose Availability over Consistency when your business requirements allow for some flexibility around when the data in the system synchronizes. Availability is also a compelling option when the system needs to continue to function in spite of external errors (shopping carts, etc.)

AP is a good choice if the business needs allow for eventual consistency or when the system needs to continue working despite external errors.

Consistency patterns

Weak consistency

After a write, reads may or may not see it. A best effort approach is taken.

This approach is seen in systems such as memcached. Weak consistency works well in real time use cases such as VoIP, video chat, and realtime multiplayer games. For example, if you are on a phone call and lose reception for a few seconds, when you regain connection you do not hear what was spoken during connection loss.

Eventual consistency

After a write, reads will eventually see it (typically within milliseconds). Data is replicated asynchronously.

This approach is seen in systems such as DNS and email. Eventual consistency works well in highly available systems.

Strong consistency

After a write, reads will see it. Data is replicated synchronously.

This approach is seen in file systems and RDBMSes. Strong consistency works well in systems that need transactions.

Availability patterns

There are two main patterns to support high availability: fail-over and replication.

Fail-over

  1. Active-passive

With active-passive fail-over, heartbeats are sent between the active and the passive server on standby. If the heartbeat is interrupted, the passive server takes over the active's IP address and resumes service.

The length of downtime is determined by whether the passive server is already running in 'hot' standby or whether it needs to start up from 'cold' standby. Only the active server handles traffic.

Active-passive failover can also be referred to as master-slave failover.

有一个信号在 active 和 passive server 之间通信。如果这个信号中断了,passive 的 server 就会立马接管 active 的 ip 继续服务。

  1. Active-active

In active-active, both servers are managing traffic, spreading the load between them.

If the servers are public-facing, the DNS would need to know about the public IPs of both servers. If the servers are internal-facing, application logic would need to know about both servers.

Active-active failover can also be referred to as master-master failover.

相当于有两个 server 来同时 serve traffic。

  1. Disadvantage

  2. Fail-over adds more hardware and additional complexity.

  3. There is a potential for loss of data if the active system fails before any newly written data can be replicated to the passive.

Replication

主要有 master-master 和 master-slave 两种,会在 database 的部分详谈。

Availability in numbers

Availability is often quantified by uptime (or downtime) as a percentage of time the service is available. Availability is generally measured in number of 9s--a service with 99.99% availability is described as having four 9s.

Availability in parallel vs in sequence

If a service consists of multiple components prone to failure, the service's overall availability depends on whether the components are in sequence or in parallel.

  • In sequence -

    Overall availability decreases when two components with availability < 100% are in sequence:

Availability (Total) = Availability (Foo) * Availability (Bar)

  • In parallel -

    Overall availability increases when two components with availability < 100% are in parallel:

Availability (Total) = 1 - (1 - Availability (Foo)) * (1 - Availability (Bar))

Last updated