Design a scalable file distribution system

This solution can also be extended to code deployment system.

Problem Description

In a multiple region multiple zone data center, there is a need to replicate an ever(meaning: at all times; always.) changing file ~10GB to a growing 10k nodes. Every node could go down. There should not be any data loss.

关于不同 system 保存 duplicates 的意义：
In cassandra/dynamodb, duplicates are used for read/write quorum to increase the availability. Data are in different versions, need to resolve the conflict before use. Individual copy is written on top of Berkeley Database (BDB) Transactional Data Store, therefore no corrupted data.
In file system, duplicates are stored to prevent one part of data is corrupted / lost. We could use checksum or hash of the file to detect if one file is correct or incorrect.
In P2P system, node could show up / disconnect at will. The file is checked correctness from the checksum.

另一个类似的问题：如果存的是 100 个 100 GB 的 file 怎么存？

Short answer is using Amazon S3 to store these files. (S3 size limit is 5T).
If do if yourself, you need to breakdown the files into smaller chunks (GFS use 64MB or 128MB).
- Then 100GB file is broke into ~1000 parts. Each part is replicated and stored separately.
- We could also use zookeeper or (DB + k8s) to keep track of the locations of files.
- Depending on the write pattern, we could either keep track of different parts from different versions or completely rewrite per version.

General Solution

High level architecture is using zookeeper to keep track of the location of master file (要拷贝的 file), version change of master file, membership management of a file sharing group.

And use bittorrent to transfer large files peer to peer to avoid a single node traffic overloaded.

Solution proposed here is DIY, zookeeper is very easy to be used wrong.
Alternative solution could be use k8s(backed by etcd) to maintain the membership. Use S3 as source of truth location. Use p2p network to transfer file.
bittorrent has a seed group management, but the seed group has a limitation of around 2k, also don't know how to use that. I think it might be doable use bittorrent to manage the membership as well.

First Thing First

The master file SHOULD NOT be changed too frequently, otherwise the “write amplification” is proportional to the number of nodes that need to be replicated. If write is done too frequently, the next round of replications could be started while previous rounds are still ongoing.
File should be replicated in case a single point of failure in master file and crash the whole system. (防止 master file down 掉)
There should be no data loss for the changes appended to the file.
Read from replicas could read stale data but not corrupted data.

Architecture

(1) You have a small group of machines (greater than 1, and 3 is a good number) to keep the up to date version of files.

Each write is ensured to write to all of the machines before returning success to the requester.
New write is appended as changelogs to a file. e.g. file_name+new_version_number.
The write order is the total broadcast order, i.e no write loss and all replicas got the changes in the same order. If small data loss is acceptable, no replication needed.

(2) When a certain threshold is met, like after a certain time period or after more than M number of changes, a pump of version is issued.

New append log file is created as file_name+(new_version_number + 1) and all new requests are redirected to this new log file.
The old log is now fully merging with the old file.

(3) Master node of this small group uploads the merged file of the current version to amazon S3.

Once the file upload is completed, register the zookeeper with the newly created version of file. If the master node is dead, promote a slave node as master.
Zookeeper also tracks the group memberships for the broadcast group. Since this is within the Data Center, membership changes should be rare.

(4) Once zookeeper is updated with a new version and its amazon s3 location, a new download command is issued to all group members, group members will use bittorrent to share the files among them, first starting from the seeder.

(5) At any given time, each replica will have a list of the same file of different versions: FileA_1, FileA_2, FileA_3_crdownload, etc. The download in progress file will have an additional appendix, and should not be used to serve read requests. The stale version of files could be cleaned up later by a background process.

PreviousDesign a key-value cache to save the results of the most recent web server queries NextDesign Amazon's sales ranking by category feature

Last updated 3 years ago

Was this helpful?