Design notification system (scott)

Design notification system (scott)

Problem

Create a system for your company that supports the notifications. The notification includes:

  • In-app notification like apple / android built-in notification

  • Email notification

  • Phone notification

  • SMS notification

Various integration with third party services like sendGrid, twilio.

Support more than one delivery method:

  • At least once

  • At most once

  • Exactly once [if possible]

An unified interface for other services to use your system. A real time system dashboard to show the processes and how many notifications are sent, in progress and queued.


Business Use Case

MVP

  • Delivery of notifications for varies of receivers (apps, email, phone, sms)

  • Delivery support different of MODE (at least once, at most once)

  • Push / pull model for active subscribers / idle subscribers

Bonus

  • Delivery support for exactly once (2PC, transactional)

  • Recurring notification

  • Scheduled notification

  • Images/videos

Non Goal

  • Latency for delivering the notifications

  • Maintain the order of notifications


Constraints

  • High Availability

  • High Scalability

  • Flexibility


Traffic Estimation

data points: Facebook 200M active user per day, 5 notification per user

  • DAU: 200 M

  • QPS: 200M5/(243600) =104QPS200 M * 5 / (24 * 3600) ~= 10^4 QPS

  • Peak: 51045 * 10^4


High-level design

  • 可以用 Kafka/Flink 来做 monitoring system

    • 注意:不能用 log 来做 real-time 的 dashboard,因为它会有数据的丢失。


API Design

createTopic(TopicName, SearviceType, Metadata)

  • example data: Ads_campian_1234, In_app, Priority, SecurityMetadata

  • Topic - Topic ID, Topic Name, Service Type, Topic MetaData, Messages


send(TopicID, SEND_MODE)

  • SEND_MODE: at_least_once, at_most_once, exactly_once


subscribe(TopicID, SUB_MODE)

  • SUB_MODE, Priority


Database Design

Message Storage Table - NoSQL

  • DynamoDB

  • Cassandra - write heavy - Cassendra's log structured merge tree is suitable for write heavy worklord. Also, it has multi-master architecture and partioning data across all nodes.

Message Storage Table(DynamoDB)

MessageID (PartitionKey)Timestamp (sortKey)topicIDmessagesenderID

abc_123

897987686

223

"hello word"

112

Metadata Table

MessageIDStatusSendModeServiceTypeReceiverIDtimestamps

abc_123

PENDING

AT_LEAST_ONCE

Email

112

24253535


Detailed Design

Message Status

  • message status: PENDING, SENDING, DELIVERED/FAILED(CLICK|UNSUBSCRIBE)

  • 当我们把 publisher 给的数据存到数据库之后我们就可以告诉publisher 你的 notification we received.

  • 这样优点是 availabilify 高,一旦保存好就直接告诉 publisher 了,之后有一个 async 的 thread 来读数据库中 PENDING 的 record


Life Cycle(Service_A send SMS to User_1)

  1. Call API with metadata and msg send(topicID, at_least_once, message). message status label to PENDING. (但这时不能返回给客户,因为 server 有可能 crash。只有当第 4 步存到 DB,才可以返回给客户收到)

  2. LB route msg to Kafka/Flink for monitoring

  3. Call Metadata Service to get topic object(json) // new topic including topic storage

  4. store the message(update the msg status to SENDING/FAILED) -> return to client with msg receipt; client can poll the receipt to check status. (只有我们把 msg 存好,才能返回给 customer 收到!)

  5. Sender send the msg

  6. If Sender go Timeout/Exceptions -> retry Queue(DLQ) -> (send a kafka topic to monitoring system, update the msg status to SENDING)

  7. send to SMS/Email/Phone -> send back Ack


如何防止数据丢失?

我们保存 notification log 在 database. worker 在从 queue 里面拿到数据后还会保存notification log


用户只会收到一次 notification 吗?

  • 我们无法保证,实际上用户很有可能收到多次 notification,我们需要在客户端也做 dedupe mechanism

  • 我们可以根据 notificationID 来去重,server 端也可以加过滤不过这是为了防止垃圾邮件重复多次提醒


使用模板来加速 Notification Template

  • 很多时候邮件都是相似的,只有日期和姓名不一样,比如给你发 offer 或者拒信,都是现成的数据,所以我们只需要个人信息直接填充模板即可

  • 格式更少出错,并且速度更快


信息发送失败 retry

  • 下游 dependency 出问题很正常,比如 firebase down 了,信息没发出去。这个 task 会再被丢回 queue,假设我们 retry 3次(设置max retry number), 还失败,那就需要告诉 producer发送者,同时 oncall 起来修修看。

  • backoff retry mechanism: SNS retry 机制:开始很快 retry,然后间隔时间逐步加大,过一会儿再 retry,然后再加大……


我们的信息是保证发送顺序 in order 的吗?

  • 不是的,这个和只 deliver y一次是同一个问题,因为网络可能出错,用户手机接收可能出错,在有 retry 的情况下我们无法保证前后的顺序。

  • 我们可以设置不同的 queue,做一些 hash,把同一个 user 的消息尽量放到同一个 queue 里。这样就能尽可能保证 message 是按顺序的。

    • 另外,一个 worker access 一个 queue。如果多个 worker 同时 access 同一个 queue,很容易 mess it up,duplicate message 之类的。


信息发送的 priority 设计?

我们可以在 queue 前面加一个模块来做 prioritize

  • 第一优先级 OTP(one time password), 用户没这个不能登录游戏了!

  • transaction notification, 您好,快递到了请签收一下,您排队 2 小时的小肥羊终于轮到你了。

  • promotion message, 恭喜您这个月我们衣服价格打九七折!

Last updated