Design a newsfeed system
Design a newsfeed system
Requirements
Functional requirements
generate newsfeed
subscribe others, publish feed
get/store text, image, video info
user get notification and get published feed
Non-functional requirements
Our system should be able to generate any user’s newsfeed in real-time - maximum latency seen by the end user would be 2s.
A post shouldn’t take more than 5s to make it to a user’s feed assuming a new newsfeed request comes in.
Traffic
DAU: 30 M
each user use 10 times/day
QPS
30 M * 10 / 86400 = 3500 requests/s
Storage
each user keep 500 posts, each post is 1 kb
30 M * 500 * 1kb = 15 TB
if each server is 100 GB -> 150 server
API design
Feed publish
POST /v1/me/publishFeed
Params:
user_id(int)
auth_token(string): request authentication
content(string)
News feed retrieve
GET /v1/me/getFeeds
Params:
user_id(int)
auth_token(string): request authentication
pagination
Database Design
User: 可以是自己和别的用户好友,可以订阅公众号
Entity (e.g. page, group, etc.):(公众号)用户和公众号都可以发文章
FeedItem (or Post): (一篇微信公众号文章,或者一个朋友圈)我们加个表格记录一下是谁发的
然后还可以加 like table,comment table
hot 的内容保存在 memory,disk 为辅助。queue 用来 publish,用户从 queue 里 poll。
High-level Design
Application Server
authentication and rate-limiting
Only users signed in with valid auth_token are allowed to make posts
limits the number of posts a user can make within a certain period
Detailed Design
Newsfeed building
User: a user sends a request to retrieve her news feed. The request looks like this:
/v1/me/feed
.Load balancer: load balancer redirects traffic to web servers.
Web servers: web servers route requests to newsfeed service.
Newsfeed service: news feed service fetches news feed from the cache.
Newsfeed cache: store news feed IDs needed to render the news feed.
Fanout Service
the process of delivering a post to all friends
two types of fanout modes:
fanout on write (also called push model)
fanout on read (also called pull model)
Push Model
news feed is pre-computed during write time
a new post is delivered to friends’ cache immediately after it is published.
Pros:
The news feed is generated in real-time and can be pushed to friends immediately
Fetching news feed is fast because the news feed is pre-computed during write time.
Cons:
If a user has many friends, fetching the friend list and generating news feeds for all of them are slow and time consuming. It is called hotkey problem.
For inactive users or those rarely log in, pre-computing news feeds waste computing resources.
Pull Model
news feed is generated during read time
Recent posts are pulled when a user loads her home page.
Pros:
Data is not pushed to friends so there is no hotkey problem.
For inactive users or those who rarely log in, fanout on read works better because it will not waste computing resources on them.
Cons:
Fetching the news feed is slow as the news feed is not pre-computed
Hybrid Model (Recommend)
push model for the majority of users
pull model for celebrities (who have many friends/followers)
Consistent hashing is a useful technique to mitigate the hotkey problem as it helps to distribute requests/data more evenly.
Fanout lifecyle
Fetch friend IDs from the graph database. Graph databases(like neo4j) are suited for managing friend relationship and friend recommendations.
Get friends info from the user cache. The system then filters out friends based on user settings.
For example, if you mute someone, her posts will not show up on your news feed even though you are still friends.
Another reason why posts may not show is that a user could selectively share information with specific friends or hide it from other people.
Send friends list and new post ID to the message queue.
Fanout workers fetch data from the message queue and store news feed data in the news feed cache.
You can think of the news feed cache as a
<post_id, user_id>
mapping table.Whenever a new post is made, it will be appended to the news feed table.
Store
<post_id, user_id>
in news feed cache.
Newsfeed retrieval deep dive
A user sends a request to retrieve her news feed. The request looks like this:
/v1/me/getFeeds
The load balancer redistributes requests to web servers.
Web servers call the news feed service to fetch news feeds.
News feed service gets a list post IDs from the news feed cache.
A user’s news feed is more than just a list of feed IDs.
It contains username, profile picture, post content, post image, etc.
Thus, the news feed service fetches the complete user and post objects from caches (user cache and post cache) to construct the fully hydrated news feed.
The fully hydrated news feed is returned in JSON format back to the client for rendering.
Cache architecture
Cache is extremely important for a news feed system. We divide the cache tier into 5 layers:
News Feed: It stores IDs of news feeds
Content: It stores every post data. Popular content is stored in hot cache.
Social Graph: It stores user relationship data.
Action: It stores info about whether a user liked a post, replied a post, or took other actions on a post.
Counters: It stores counters for like, reply, follower, following, etc.
Other Talk Points
Scaling the database:
Vertical scaling vs Horizontal scaling
SQL vs NoSQL
Master-slave replication
Read replicas
Consistency models
Database sharding
Other talking points:
Keep web tier stateless
Cache data as much as you can
Support multiple data centers
Lose couple components with message queues
Monitor key metrics. For instance, QPS during peak hours and latency while users refreshing their news feed are interesting to monitor.
Last updated
Was this helpful?