System Design
  • Introduction
  • Glossary of System Design
    • System Design Basics
    • Key Characteristics of Distributed Systems
    • Scalability - Harvard lecture
      • Scalability for Dummies - Part 1: Clones
      • Scalability for Dummies - Part 2: Database
      • Scalability for Dummies - Part 3: Cache
      • Scalability for Dummies - Part 4: Asynchronism
    • Trade-off
      • CAP Theorem
      • Performance vs scalability
      • Latency vs throughput
      • Availability vs consistency
    • Load Balancing
      • Load balancer
    • Proxies
      • Reverse proxy
    • Cache
      • Caching
    • Asynchronism
    • Processing guarantee in Kafka
    • Database
      • Relational database management system (RDBMS)
      • Redundancy and Replication
      • Data Partitioning
      • Indexes
      • NoSQL
      • SQL vs. NoSQL
      • Consistent Hashing
    • Application layer
    • DNS
    • CDN
    • Communication
      • Long-Polling vs WebSockets vs Server-Sent Events
    • Security
    • Lambda Architecture
  • OOD design
    • Concepts
      • Object-Oriented Basics
      • OO Analysis and Design
      • What is UML?
      • Use Case Diagrams
    • Design a parking lot
  • System Design Cases
    • Overview
    • Design a system that scales to millions of users on AWS
    • Designing a URL Shortening service like TinyURL
      • Design Unique ID Generator
      • Designing Pastebin
      • Design Pastebin.com (or Bit.ly)
    • Design notification system (scott)
      • Designing notification service
    • Designing Chat System
      • Designing Slack
      • Designing Facebook Messenger
    • Design Top K System
    • Designing Instagram
    • Design a newsfeed system
      • Designing Facebook’s Newsfeed
      • Design the data structures for a social network
    • Designing Twitter
      • Design the Twitter timeline and search
      • Designing Twitter Search
    • Design Youtube - Scott
      • Design live commenting
      • Designing Youtube or Netflix
    • Designing a Web Crawler
      • Designing a distributed job scheduler
      • Designing a Web Crawler/Archive (scott)
      • Design a web crawler
    • Designing Dropbox
    • Design Google Doc
    • Design Metrics Aggregation System
      • Design Ads Logging System
    • Design Instacart
    • Design a payment system
      • Airbnb - Avoiding Double Payments in a Distributed Payments System
    • Design Distributed Message Queue
      • Cherami: Uber Engineering’s Durable and Scalable Task Queue in Go
    • Design Distributed Cache
      • Design a key-value cache to save the results of the most recent web server queries
    • Design a scalable file distribution system
    • Design Amazon's sales ranking by category feature
    • Design Mint.com
    • Design Autocomplete System
      • Designing Typeahead Suggestion
    • Designing an API Rate Limiter
      • Designing Rate Limiter
    • Design Google Map
      • Designing Yelp or Nearby Friends
      • Designing Uber backend
    • Designing Ticketmaster
      • Design 12306 - Scott
    • Design AirBnB or a Hotel Booking System
  • Paper Reading
    • MapReduce
  • Other Questions
    • What happened after you input the url in the browser?
Powered by GitBook
On this page
  • Design Top K System
  • Requirement
  • Single Host - Hash table/PriorityQueue
  • Top K Algorithm Implementation
  • Multiple Host - Hash table
  • Partition and Merge
  • Count-min sketch
  • High-level architecture
  • API Gateway
  • Fast Processor
  • Storage
  • Data Partitioner
  • Detailed Design
  • Fast path data flow (last 1 min top k)
  • Slow path data flow (last 1 hr top k)
  • Data Retrieval
  • Other Questions

Was this helpful?

  1. System Design Cases

Design Top K System

PreviousDesigning Facebook MessengerNextDesigning Instagram

Last updated 3 years ago

Was this helpful?

Design Top K System

https://www.youtube.com/watch?v=kx-XDoPjoHw


Requirement


Single Host - Hash table/PriorityQueue

Top K Algorithm Implementation

However, this single host solution is not scalable with high throughput


Multiple Host - Hash table

  • Pro:

    • process in memory

    • throughput is increased

  • Con:

    • hash table uses huge memory in total

Partition and Merge

  • Pro:

    • increased throughput

    • reduce memory usage

  • Con:

    • this solution dropped other info. e.g. 现在我们计算的是 last 1 hour 的 top K,如果我们要计算 last 1 day 呢?是不是就没办法了。

Count-min sketch

  • 一个字母对应多个 hash function,每次字母来了都更新这些 function 的值

  • 因为过程中可能会有其他字母的 collision,所以我们最后找的是同一个字母对应的 minimum of all the counters

  • Pro:

    • fixed size memory

  • Con:

    • sacrifice with accuracy


High-level architecture

API Gateway

  • Single entry for all clients

  • aggregate data on the fly or via background process that process logs. Data is flushed based on either time or size.

  • Serialize data in a compact binary format(e.g. Apache Avro)

Fast Processor

  • Create count-min sketch and aggregate data for a short period fo time(seconds)

  • Because memory is no longer a problem, no need to partition the data.

  • Data replication is nice to have, but not be strictly required.

Storage

  • SQL, NoSQL

  • Build the final count-min sketch and store a list of top k elements for a period of time

  • Data replication is required

Data Partitioner

  • Parse batches of events into individual events.

  • Hash partitioning(e.g. video identifier + time window)

  • deal with hot partition


Detailed Design

Fast path data flow (last 1 min top k)

Slow path data flow (last 1 hr top k)

MapReduce jobs


Data Retrieval


Other Questions

You mentioned we aggregate data on the API Gateway side. What if this is not possible due to CPU or memory constraints on the API Gateway hosts?

  • No matter how busy these hosts are, there will be some hosts offload these logs from Gateway hosts.

  • We can run log storage and log parsing on some cluster and rest pipeline stays the same.

Any alternatives to count-min sketch?

  • Lossy Counting, Space Saving and Sticky Sampling.

How big is the K?

  • Fast path and slow path, we both need to merge lists to get the final resuts

  • several thousands should be OK, but tens of thousands may cause performance degradation.

What are the drawbacks of this architecture?

  • This is Lambda Architecture. The drawback is the complexity of the system.