Design notification system (scott)

Problem

Create a system for your company that supports the notifications. The notification includes:

In-app notification like apple / android built-in notification
Email notification
Phone notification
SMS notification

Various integration with third party services like sendGrid, twilio.

Support more than one delivery method:

At least once
At most once
Exactly once [if possible]

An unified interface for other services to use your system. A real time system dashboard to show the processes and how many notifications are sent, in progress and queued.

Business Use Case

MVP

Delivery of notifications for varies of receivers (apps, email, phone, sms)
Delivery support different of MODE (at least once, at most once)
Push / pull model for active subscribers / idle subscribers

Bonus

Delivery support for exactly once (2PC, transactional)
Recurring notification
Scheduled notification
Images/videos

Non Goal

Latency for delivering the notifications
Maintain the order of notifications

Constraints

High Availability
High Scalability
Flexibility

Traffic Estimation

data points: Facebook 200M active user per day, 5 notification per user

DAU: 200 M
QPS: $200 M * 5 / (24 * 3600) ~= 10^4 QPS$
Peak: $5 * 10^4$

High-level design

可以用 Kafka/Flink 来做 monitoring system
- 注意：不能用 log 来做 real-time 的 dashboard，因为它会有数据的丢失。

API Design

createTopic(TopicName, SearviceType, Metadata)

example data: Ads_campian_1234, In_app, Priority, SecurityMetadata
Topic - Topic ID, Topic Name, Service Type, Topic MetaData, Messages

send(TopicID, SEND_MODE)

SEND_MODE: at_least_once, at_most_once, exactly_once

subscribe(TopicID, SUB_MODE)

SUB_MODE, Priority

Database Design

Message Storage Table - NoSQL

DynamoDB
Cassandra - write heavy - Cassendra's log structured merge tree is suitable for write heavy worklord. Also, it has multi-master architecture and partioning data across all nodes.

Message Storage Table(DynamoDB)

MessageID (PartitionKey)

Timestamp (sortKey)

topicID

message

senderID

abc_123

897987686

223

"hello word"

112

Metadata Table

MessageID

Status

SendMode

ServiceType

ReceiverID

timestamps

abc_123

PENDING

AT_LEAST_ONCE

112

24253535

Detailed Design

Message Status

message status: PENDING, SENDING, DELIVERED/FAILED(CLICK|UNSUBSCRIBE)
当我们把 publisher 给的数据存到数据库之后我们就可以告诉publisher 你的 notification we received.
这样优点是 availabilify 高，一旦保存好就直接告诉 publisher 了，之后有一个 async 的 thread 来读数据库中 PENDING 的 record

Life Cycle（Service_A send SMS to User_1）

Call API with metadata and msg send(topicID, at_least_once, message). message status label to PENDING. (但这时不能返回给客户，因为 server 有可能 crash。只有当第 4 步存到 DB，才可以返回给客户收到)
LB route msg to Kafka/Flink for monitoring
Call Metadata Service to get topic object(json) // new topic including topic storage
store the message(update the msg status to SENDING/FAILED) -> return to client with msg receipt; client can poll the receipt to check status. (只有我们把 msg 存好，才能返回给 customer 收到！)
Sender send the msg
If Sender go Timeout/Exceptions -> retry Queue(DLQ) -> (send a kafka topic to monitoring system, update the msg status to SENDING)
send to SMS/Email/Phone -> send back Ack

如何防止数据丢失？

我们保存 notification log 在 database. worker 在从 queue 里面拿到数据后还会保存notification log

用户只会收到一次 notification 吗？

我们无法保证，实际上用户很有可能收到多次 notification，我们需要在客户端也做 dedupe mechanism
我们可以根据 notificationID 来去重，server 端也可以加过滤不过这是为了防止垃圾邮件重复多次提醒

使用模板来加速 Notification Template

很多时候邮件都是相似的，只有日期和姓名不一样，比如给你发 offer 或者拒信，都是现成的数据，所以我们只需要个人信息直接填充模板即可
格式更少出错，并且速度更快

信息发送失败 retry

下游 dependency 出问题很正常，比如 firebase down 了，信息没发出去。这个 task 会再被丢回 queue，假设我们 retry 3次(设置max retry number)，还失败，那就需要告诉 producer发送者，同时 oncall 起来修修看。
backoff retry mechanism: SNS retry 机制：开始很快 retry，然后间隔时间逐步加大，过一会儿再 retry，然后再加大……

我们的信息是保证发送顺序 in order 的吗?

不是的，这个和只 deliver y一次是同一个问题，因为网络可能出错，用户手机接收可能出错，在有 retry 的情况下我们无法保证前后的顺序。

我们可以设置不同的 queue，做一些 hash，把同一个 user 的消息尽量放到同一个 queue 里。这样就能尽可能保证 message 是按顺序的。
- 另外，一个 worker access 一个 queue。如果多个 worker 同时 access 同一个 queue，很容易 mess it up，duplicate message 之类的。

信息发送的 priority 设计？

我们可以在 queue 前面加一个模块来做 prioritize

第一优先级 OTP（one time password）, 用户没这个不能登录游戏了！
transaction notification, 您好，快递到了请签收一下，您排队 2 小时的小肥羊终于轮到你了。
promotion message, 恭喜您这个月我们衣服价格打九七折！

PreviousDesign Pastebin.com (or Bit.ly)NextDesigning notification service

Last updated 3 years ago

Was this helpful?