Design Top K System
Last updated
Last updated
https://www.youtube.com/watch?v=kx-XDoPjoHw
However, this single host solution is not scalable with high throughput
Pro:
process in memory
throughput is increased
Con:
hash table uses huge memory in total
Pro:
increased throughput
reduce memory usage
Con:
this solution dropped other info. e.g. 现在我们计算的是 last 1 hour 的 top K,如果我们要计算 last 1 day 呢?是不是就没办法了。
一个字母对应多个 hash function,每次字母来了都更新这些 function 的值
因为过程中可能会有其他字母的 collision,所以我们最后找的是同一个字母对应的 minimum of all the counters
Pro:
fixed size memory
Con:
sacrifice with accuracy
Single entry for all clients
aggregate data on the fly or via background process that process logs. Data is flushed based on either time or size.
Serialize data in a compact binary format(e.g. Apache Avro)
Create count-min sketch and aggregate data for a short period fo time(seconds)
Because memory is no longer a problem, no need to partition the data.
Data replication is nice to have, but not be strictly required.
SQL, NoSQL
Build the final count-min sketch and store a list of top k elements for a period of time
Data replication is required
Parse batches of events into individual events.
Hash partitioning(e.g. video identifier + time window)
deal with hot partition
You mentioned we aggregate data on the API Gateway side. What if this is not possible due to CPU or memory constraints on the API Gateway hosts?
No matter how busy these hosts are, there will be some hosts offload these logs from Gateway hosts.
We can run log storage and log parsing on some cluster and rest pipeline stays the same.
Any alternatives to count-min sketch?
Lossy Counting, Space Saving and Sticky Sampling.
How big is the K?
Fast path and slow path, we both need to merge lists to get the final resuts
several thousands should be OK, but tens of thousands may cause performance degradation.
What are the drawbacks of this architecture?
This is Lambda Architecture. The drawback is the complexity of the system.