Design Top K System

Design Top K System

https://www.youtube.com/watch?v=kx-XDoPjoHw


Requirement


Single Host - Hash table/PriorityQueue

Top K Algorithm Implementation

However, this single host solution is not scalable with high throughput


Multiple Host - Hash table

  • Pro:

    • process in memory

    • throughput is increased

  • Con:

    • hash table uses huge memory in total

Partition and Merge

  • Pro:

    • increased throughput

    • reduce memory usage

  • Con:

    • this solution dropped other info. e.g. 现在我们计算的是 last 1 hour 的 top K,如果我们要计算 last 1 day 呢?是不是就没办法了。

Count-min sketch

  • 一个字母对应多个 hash function,每次字母来了都更新这些 function 的值

  • 因为过程中可能会有其他字母的 collision,所以我们最后找的是同一个字母对应的 minimum of all the counters

  • Pro:

    • fixed size memory

  • Con:

    • sacrifice with accuracy


High-level architecture

API Gateway

  • Single entry for all clients

  • aggregate data on the fly or via background process that process logs. Data is flushed based on either time or size.

  • Serialize data in a compact binary format(e.g. Apache Avro)

Fast Processor

  • Create count-min sketch and aggregate data for a short period fo time(seconds)

  • Because memory is no longer a problem, no need to partition the data.

  • Data replication is nice to have, but not be strictly required.

Storage

  • SQL, NoSQL

  • Build the final count-min sketch and store a list of top k elements for a period of time

  • Data replication is required

Data Partitioner

  • Parse batches of events into individual events.

  • Hash partitioning(e.g. video identifier + time window)

  • deal with hot partition


Detailed Design

Fast path data flow (last 1 min top k)

Slow path data flow (last 1 hr top k)

MapReduce jobs


Data Retrieval


Other Questions

You mentioned we aggregate data on the API Gateway side. What if this is not possible due to CPU or memory constraints on the API Gateway hosts?

  • No matter how busy these hosts are, there will be some hosts offload these logs from Gateway hosts.

  • We can run log storage and log parsing on some cluster and rest pipeline stays the same.

Any alternatives to count-min sketch?

  • Lossy Counting, Space Saving and Sticky Sampling.

How big is the K?

  • Fast path and slow path, we both need to merge lists to get the final resuts

  • several thousands should be OK, but tens of thousands may cause performance degradation.

What are the drawbacks of this architecture?

  • This is Lambda Architecture. The drawback is the complexity of the system.

Last updated