System Design
  • Introduction
  • Glossary of System Design
    • System Design Basics
    • Key Characteristics of Distributed Systems
    • Scalability - Harvard lecture
      • Scalability for Dummies - Part 1: Clones
      • Scalability for Dummies - Part 2: Database
      • Scalability for Dummies - Part 3: Cache
      • Scalability for Dummies - Part 4: Asynchronism
    • Trade-off
      • CAP Theorem
      • Performance vs scalability
      • Latency vs throughput
      • Availability vs consistency
    • Load Balancing
      • Load balancer
    • Proxies
      • Reverse proxy
    • Cache
      • Caching
    • Asynchronism
    • Processing guarantee in Kafka
    • Database
      • Relational database management system (RDBMS)
      • Redundancy and Replication
      • Data Partitioning
      • Indexes
      • NoSQL
      • SQL vs. NoSQL
      • Consistent Hashing
    • Application layer
    • DNS
    • CDN
    • Communication
      • Long-Polling vs WebSockets vs Server-Sent Events
    • Security
    • Lambda Architecture
  • OOD design
    • Concepts
      • Object-Oriented Basics
      • OO Analysis and Design
      • What is UML?
      • Use Case Diagrams
    • Design a parking lot
  • System Design Cases
    • Overview
    • Design a system that scales to millions of users on AWS
    • Designing a URL Shortening service like TinyURL
      • Design Unique ID Generator
      • Designing Pastebin
      • Design Pastebin.com (or Bit.ly)
    • Design notification system (scott)
      • Designing notification service
    • Designing Chat System
      • Designing Slack
      • Designing Facebook Messenger
    • Design Top K System
    • Designing Instagram
    • Design a newsfeed system
      • Designing Facebook’s Newsfeed
      • Design the data structures for a social network
    • Designing Twitter
      • Design the Twitter timeline and search
      • Designing Twitter Search
    • Design Youtube - Scott
      • Design live commenting
      • Designing Youtube or Netflix
    • Designing a Web Crawler
      • Designing a distributed job scheduler
      • Designing a Web Crawler/Archive (scott)
      • Design a web crawler
    • Designing Dropbox
    • Design Google Doc
    • Design Metrics Aggregation System
      • Design Ads Logging System
    • Design Instacart
    • Design a payment system
      • Airbnb - Avoiding Double Payments in a Distributed Payments System
    • Design Distributed Message Queue
      • Cherami: Uber Engineering’s Durable and Scalable Task Queue in Go
    • Design Distributed Cache
      • Design a key-value cache to save the results of the most recent web server queries
    • Design a scalable file distribution system
    • Design Amazon's sales ranking by category feature
    • Design Mint.com
    • Design Autocomplete System
      • Designing Typeahead Suggestion
    • Designing an API Rate Limiter
      • Designing Rate Limiter
    • Design Google Map
      • Designing Yelp or Nearby Friends
      • Designing Uber backend
    • Designing Ticketmaster
      • Design 12306 - Scott
    • Design AirBnB or a Hotel Booking System
  • Paper Reading
    • MapReduce
  • Other Questions
    • What happened after you input the url in the browser?
Powered by GitBook
On this page
  • Design Metrics Aggregation System
  • Requirement
  • Functional Requirement
  • Non-Functional requirements
  • High-level design
  • Detailed design
  • Query Design
  • Database selection - Where should the metrics data be stored?
  • Middleware - How do we handle the high throughput?
  • How is the metrics data transferred from Kafka to Elasticsearch?
  • Ensure high availability
  • Data partition

Was this helpful?

  1. System Design Cases

Design Metrics Aggregation System

PreviousDesign Google DocNextDesign Ads Logging System

Last updated 3 years ago

Was this helpful?

Design Metrics Aggregation System

Two Sigma 的 , 讲得非常好!

这个 实际上是模仿 Two Sigma 做的。

对底层原理解释得不错。

对上面的进行了总结和讲解,讲得也不错。


Requirement

Functional Requirement

  • add instrumentation for services that runs on hundreds of services in cloud

  • collect stats about servers where application runs

  • visualize & monitor the metrics

Non-Functional requirements

  • Reliable

  • Scalable

  • Low end-to-end latency

  • Flexible for different schema


High-level design


Detailed design

Query Design

  • we can collect query-specific information like what’s shown below

  • by storing the data in its raw form, we are able to do exploratory analysis using dynamic “if that, then what?” questions.

{
    query_time: "20190117 09:00::05",
    user: "user1",
    dataset: "x",
    date_range: {
        begin: "20180203",
        end: "20180204"
    },
    duration: 100,
    bytes: 350000000,
    query_param_1: 1.0,
    query_param_2: "log",
    ...
}

Database selection - Where should the metrics data be stored?

  • Time series database: InfluxDB, graphite

  • Elasticsearch

Solution: Elasticsearch

  • best ELK bundle: elasticsearch + logstash + kibana

  • easy searching and analysis, and it comes with plugins, like Kibana, that make data analysis and visualization easy.


Middleware - How do we handle the high throughput?

Solution: Kafka

  • Elasticsearch wasn’t built to handle 50,000 messages per second (we tested this through internal benchmarks).

  • Use a buffer, which is analogous to a water tank.

  • You can think of a Kafka topic as a partitioned queue where the partitions can be written to and read from simultaneously.


How is the metrics data transferred from Kafka to Elasticsearch?

  • read data from Kafka and then serve to Elasticsearch require a lot fo application code, handling edge cases and errors

  • existing messaging solutions for connecting Kafka and Elasticsearch

Solution: use logstash as a pipeline

  • config logstash

input {
  kafka {
    bootstrap_servers => "kafka-server.host:9095"
    topic_id => "usage-topic"
    codec => "json"
    group_id => "consumer-group-a"
  }
}
filter {
  date {
    match => ["query_time", "UNIX_MS"]
    remove_field => ["query_time"]
  }
}
output {
  elasticsearch {
    index => "metrics-%{+YYYY-MM-dd}"
    hosts => ["metrics.elasticsearch.host:443"]
  }
}

Ensure high availability

  • Use kubernetes to maintain multiple logstash instances

  • Two sigma uses marathon since they maintain it locally


Data partition

Data storage

  • Use Write-ahead-log (WAL) to store logs

  • compact old files using log-structured merge tree(LSMT).

Sharding

  • 不能只用 service name 作为 sharding key,否则会有 hot entity issue。

  • Because most of time we query by time range to list service instances’ API methods (sample picture below), the sharding key can be “time bucket” + “service name”, where the time bucket is a timestamp rounded to some interval.

  • sample query:

SELECT * FROM metric_readings
WHERE service_name = "Service A"
AND time_bucket IN (1411840800, 1411844400)
AND timestamp >= 1411841700
AND timestamp <= 1411845300;
  • 如果依然有 hot entity issue,we can append sequence number such as 1,2,3 … to it.

read this .

monitoring system
video
这篇
这篇
doc