CtrlK

Design Top K System

Design Top K System

https://www.youtube.com/watch?v=kx-XDoPjoHw

Requirement

Single Host - Hash table/PriorityQueue

Top K Algorithm Implementation

However, this single host solution is not scalable with high throughput

Multiple Host - Hash table

Pro:
- process in memory
- throughput is increased
Con:
- hash table uses huge memory in total

Partition and Merge

Pro:
- increased throughput
- reduce memory usage
Con:
- this solution dropped other info. e.g. 现在我们计算的是 last 1 hour 的 top K，如果我们要计算 last 1 day 呢？是不是就没办法了。

Count-min sketch

一个字母对应多个 hash function，每次字母来了都更新这些 function 的值
因为过程中可能会有其他字母的 collision，所以我们最后找的是同一个字母对应的 minimum of all the counters

Pro:
- fixed size memory
Con:
- sacrifice with accuracy

High-level architecture

API Gateway

Single entry for all clients
aggregate data on the fly or via background process that process logs. Data is flushed based on either time or size.
Serialize data in a compact binary format(e.g. Apache Avro)

Fast Processor

Create count-min sketch and aggregate data for a short period fo time(seconds)
Because memory is no longer a problem, no need to partition the data.
Data replication is nice to have, but not be strictly required.

Storage

SQL, NoSQL
Build the final count-min sketch and store a list of top k elements for a period of time
Data replication is required

Data Partitioner

Parse batches of events into individual events.
Hash partitioning(e.g. video identifier + time window)
deal with hot partition

Detailed Design

Fast path data flow (last 1 min top k)

Slow path data flow (last 1 hr top k)

MapReduce jobs

Data Retrieval

Other Questions

You mentioned we aggregate data on the API Gateway side. What if this is not possible due to CPU or memory constraints on the API Gateway hosts?

No matter how busy these hosts are, there will be some hosts offload these logs from Gateway hosts.
We can run log storage and log parsing on some cluster and rest pipeline stays the same.

Any alternatives to count-min sketch?

Lossy Counting, Space Saving and Sticky Sampling.

How big is the K?

Fast path and slow path, we both need to merge lists to get the final resuts
several thousands should be OK, but tens of thousands may cause performance degradation.

What are the drawbacks of this architecture?

This is Lambda Architecture. The drawback is the complexity of the system.

PreviousDesigning Facebook Messenger NextDesigning Instagram

Last updated 3 years ago

Was this helpful?