Design notification system (scott)
Design notification system (scott)
Problem
Create a system for your company that supports the notifications. The notification includes:
In-app notification like apple / android built-in notification
Email notification
Phone notification
SMS notification
Various integration with third party services like sendGrid, twilio.
Support more than one delivery method:
At least once
At most once
Exactly once [if possible]
An unified interface for other services to use your system. A real time system dashboard to show the processes and how many notifications are sent, in progress and queued.
Business Use Case
MVP
Delivery of notifications for varies of receivers (apps, email, phone, sms)
Delivery support different of MODE (at least once, at most once)
Push / pull model for active subscribers / idle subscribers
Bonus
Delivery support for exactly once (2PC, transactional)
Recurring notification
Scheduled notification
Images/videos
Non Goal
Latency for delivering the notifications
Maintain the order of notifications
Constraints
High Availability
High Scalability
Flexibility
Traffic Estimation
data points: Facebook 200M active user per day, 5 notification per user
DAU: 200 M
High-level design
可以用 Kafka/Flink 来做 monitoring system
注意:不能用 log 来做 real-time 的 dashboard,因为它会有数据的丢失。
API Design
createTopic(TopicName, SearviceType, Metadata)
example data: Ads_campian_1234, In_app, Priority, SecurityMetadata
Topic - Topic ID, Topic Name, Service Type, Topic MetaData, Messages
send(TopicID, SEND_MODE)
SEND_MODE: at_least_once, at_most_once, exactly_once
subscribe(TopicID, SUB_MODE)
SUB_MODE, Priority
Database Design
Message Storage Table - NoSQL
DynamoDB
Cassandra - write heavy - Cassendra's log structured merge tree is suitable for write heavy worklord. Also, it has multi-master architecture and partioning data across all nodes.
Message Storage Table(DynamoDB)
abc_123
897987686
223
"hello word"
112
Metadata Table
abc_123
PENDING
AT_LEAST_ONCE
112
24253535
Detailed Design
Message Status
message status:
PENDING
,SENDING
,DELIVERED/FAILED
(CLICK
|UNSUBSCRIBE
)当我们把 publisher 给的数据存到数据库之后我们就可以告诉publisher 你的 notification we received.
这样优点是 availabilify 高,一旦保存好就直接告诉 publisher 了,之后有一个 async 的 thread 来读数据库中 PENDING 的 record
Life Cycle(Service_A send SMS to User_1)
Call API with metadata and msg send(topicID, at_least_once, message). message status label to PENDING. (但这时不能返回给客户,因为 server 有可能 crash。只有当第 4 步存到 DB,才可以返回给客户收到)
LB route msg to Kafka/Flink for monitoring
Call Metadata Service to get topic object(json) // new topic including topic storage
store the message(update the msg status to SENDING/FAILED) -> return to client with msg receipt; client can poll the receipt to check status. (只有我们把 msg 存好,才能返回给 customer 收到!)
Sender send the msg
If Sender go Timeout/Exceptions -> retry Queue(DLQ) -> (send a kafka topic to monitoring system, update the msg status to SENDING)
send to SMS/Email/Phone -> send back Ack
如何防止数据丢失?
我们保存 notification log 在 database. worker 在从 queue 里面拿到数据后还会保存notification log
用户只会收到一次 notification 吗?
我们无法保证,实际上用户很有可能收到多次 notification,我们需要在客户端也做 dedupe mechanism
我们可以根据 notificationID 来去重,server 端也可以加过滤不过这是为了防止垃圾邮件重复多次提醒
使用模板来加速 Notification Template
很多时候邮件都是相似的,只有日期和姓名不一样,比如给你发 offer 或者拒信,都是现成的数据,所以我们只需要个人信息直接填充模板即可
格式更少出错,并且速度更快
信息发送失败 retry
下游 dependency 出问题很正常,比如 firebase down 了,信息没发出去。这个 task 会再被丢回 queue,假设我们 retry 3次(设置max retry number), 还失败,那就需要告诉 producer发送者,同时 oncall 起来修修看。
backoff retry mechanism: SNS retry 机制:开始很快 retry,然后间隔时间逐步加大,过一会儿再 retry,然后再加大……
我们的信息是保证发送顺序 in order 的吗?
不是的,这个和只 deliver y一次是同一个问题,因为网络可能出错,用户手机接收可能出错,在有 retry 的情况下我们无法保证前后的顺序。
我们可以设置不同的 queue,做一些 hash,把同一个 user 的消息尽量放到同一个 queue 里。这样就能尽可能保证 message 是按顺序的。
另外,一个 worker access 一个 queue。如果多个 worker 同时 access 同一个 queue,很容易 mess it up,duplicate message 之类的。
信息发送的 priority 设计?
我们可以在 queue 前面加一个模块来做 prioritize
第一优先级 OTP(one time password), 用户没这个不能登录游戏了!
transaction notification, 您好,快递到了请签收一下,您排队 2 小时的小肥羊终于轮到你了。
promotion message, 恭喜您这个月我们衣服价格打九七折!
Last updated