Design Metrics Aggregation System
Design Metrics Aggregation System
Two Sigma 的 monitoring system, 讲得非常好!
这个 video 实际上是模仿 Two Sigma 做的。
这篇对底层原理解释得不错。
这篇 对上面的进行了总结和讲解,讲得也不错。
Requirement
Functional Requirement
add instrumentation for services that runs on hundreds of services in cloud
collect stats about servers where application runs
visualize & monitor the metrics
Non-Functional requirements
Reliable
Scalable
Low end-to-end latency
Flexible for different schema
High-level design
Detailed design
Query Design
we can collect query-specific information like what’s shown below
by storing the data in its raw form, we are able to do exploratory analysis using dynamic “if that, then what?” questions.
Database selection - Where should the metrics data be stored?
Time series database: InfluxDB, graphite
Elasticsearch
Solution: Elasticsearch
best ELK bundle: elasticsearch + logstash + kibana
easy searching and analysis, and it comes with plugins, like Kibana, that make data analysis and visualization easy.
Middleware - How do we handle the high throughput?
Solution: Kafka
Elasticsearch wasn’t built to handle 50,000 messages per second (we tested this through internal benchmarks).
Use a buffer, which is analogous to a water tank.
You can think of a Kafka topic as a partitioned queue where the partitions can be written to and read from simultaneously.
How is the metrics data transferred from Kafka to Elasticsearch?
read data from Kafka and then serve to Elasticsearch require a lot fo application code, handling edge cases and errors
existing messaging solutions for connecting Kafka and Elasticsearch
Solution: use logstash as a pipeline
config logstash
Ensure high availability
Use kubernetes to maintain multiple logstash instances
Two sigma uses marathon since they maintain it locally
Data partition
Data storage
read this doc.
Use Write-ahead-log (WAL) to store logs
compact old files using log-structured merge tree(LSMT).
Sharding
不能只用 service name 作为 sharding key,否则会有 hot entity issue。
Because most of time we query by time range to list service instances’ API methods (sample picture below), the sharding key can be “time bucket” + “service name”, where the
time bucket
is a timestamp rounded to some interval.sample query:
如果依然有 hot entity issue,we can append sequence number such as 1,2,3 … to it.
Last updated