Design Metrics Aggregation System

Design Metrics Aggregation System

Two Sigma 的 monitoring system, 讲得非常好!

这个 video 实际上是模仿 Two Sigma 做的。

这篇对底层原理解释得不错。

这篇 对上面的进行了总结和讲解,讲得也不错。


Requirement

Functional Requirement

  • add instrumentation for services that runs on hundreds of services in cloud

  • collect stats about servers where application runs

  • visualize & monitor the metrics

Non-Functional requirements

  • Reliable

  • Scalable

  • Low end-to-end latency

  • Flexible for different schema


High-level design


Detailed design

Query Design

  • we can collect query-specific information like what’s shown below

  • by storing the data in its raw form, we are able to do exploratory analysis using dynamic “if that, then what?” questions.


Database selection - Where should the metrics data be stored?

  • Time series database: InfluxDB, graphite

  • Elasticsearch

Solution: Elasticsearch

  • best ELK bundle: elasticsearch + logstash + kibana

  • easy searching and analysis, and it comes with plugins, like Kibana, that make data analysis and visualization easy.


Middleware - How do we handle the high throughput?

Solution: Kafka

  • Elasticsearch wasn’t built to handle 50,000 messages per second (we tested this through internal benchmarks).

  • Use a buffer, which is analogous to a water tank.

  • You can think of a Kafka topic as a partitioned queue where the partitions can be written to and read from simultaneously.


How is the metrics data transferred from Kafka to Elasticsearch?

  • read data from Kafka and then serve to Elasticsearch require a lot fo application code, handling edge cases and errors

  • existing messaging solutions for connecting Kafka and Elasticsearch

Solution: use logstash as a pipeline

  • config logstash


Ensure high availability

  • Use kubernetes to maintain multiple logstash instances

  • Two sigma uses marathon since they maintain it locally


Data partition

Data storage

read this doc.

  • Use Write-ahead-log (WAL) to store logs

  • compact old files using log-structured merge tree(LSMT).

Sharding

  • 不能只用 service name 作为 sharding key,否则会有 hot entity issue。

  • Because most of time we query by time range to list service instances’ API methods (sample picture below), the sharding key can be “time bucket” + “service name”, where the time bucket is a timestamp rounded to some interval.

  • sample query:

  • 如果依然有 hot entity issue,we can append sequence number such as 1,2,3 … to it.

Last updated

Was this helpful?