Design a newsfeed system

Design a newsfeed system

Requirements

Functional requirements

  • generate newsfeed

  • subscribe others, publish feed

  • get/store text, image, video info

  • user get notification and get published feed

Non-functional requirements

  • Our system should be able to generate any user’s newsfeed in real-time - maximum latency seen by the end user would be 2s.

  • A post shouldn’t take more than 5s to make it to a user’s feed assuming a new newsfeed request comes in.


Traffic

  • DAU: 30 M

  • each user use 10 times/day

QPS

30 M * 10 / 86400 = 3500 requests/s

Storage

  • each user keep 500 posts, each post is 1 kb

30 M * 500 * 1kb = 15 TB

  • if each server is 100 GB -> 150 server


API design

Feed publish

POST /v1/me/publishFeed

  • Params:

    • user_id(int)

    • auth_token(string): request authentication

    • content(string)

News feed retrieve

GET /v1/me/getFeeds

  • Params:

    • user_id(int)

    • auth_token(string): request authentication

    • pagination


Database Design

  • User: 可以是自己和别的用户好友,可以订阅公众号

  • Entity (e.g. page, group, etc.):(公众号)用户和公众号都可以发文章

  • FeedItem (or Post): (一篇微信公众号文章,或者一个朋友圈)我们加个表格记录一下是谁发的

  • 然后还可以加 like table,comment table

hot 的内容保存在 memory,disk 为辅助。queue 用来 publish,用户从 queue 里 poll。


High-level Design

Application Server

  • authentication and rate-limiting

  • Only users signed in with valid auth_token are allowed to make posts

  • limits the number of posts a user can make within a certain period


Detailed Design

Newsfeed building

  • User: a user sends a request to retrieve her news feed. The request looks like this: /v1/me/feed.

  • Load balancer: load balancer redirects traffic to web servers.

  • Web servers: web servers route requests to newsfeed service.

  • Newsfeed service: news feed service fetches news feed from the cache.

  • Newsfeed cache: store news feed IDs needed to render the news feed.


Fanout Service

  • the process of delivering a post to all friends

  • two types of fanout modes:

    • fanout on write (also called push model)

    • fanout on read (also called pull model)

Push Model

  • news feed is pre-computed during write time

  • a new post is delivered to friends’ cache immediately after it is published.

  • Pros:

    • The news feed is generated in real-time and can be pushed to friends immediately

    • Fetching news feed is fast because the news feed is pre-computed during write time.

  • Cons:

    • If a user has many friends, fetching the friend list and generating news feeds for all of them are slow and time consuming. It is called hotkey problem.

    • For inactive users or those rarely log in, pre-computing news feeds waste computing resources.

Pull Model

  • news feed is generated during read time

  • Recent posts are pulled when a user loads her home page.

  • Pros:

    • Data is not pushed to friends so there is no hotkey problem.

    • For inactive users or those who rarely log in, fanout on read works better because it will not waste computing resources on them.

  • Cons:

    • Fetching the news feed is slow as the news feed is not pre-computed

Hybrid Model (Recommend)

  • push model for the majority of users

  • pull model for celebrities (who have many friends/followers)

  • Consistent hashing is a useful technique to mitigate the hotkey problem as it helps to distribute requests/data more evenly.

Fanout lifecyle

  1. Fetch friend IDs from the graph database. Graph databases(like neo4j) are suited for managing friend relationship and friend recommendations.

  2. Get friends info from the user cache. The system then filters out friends based on user settings.

    • For example, if you mute someone, her posts will not show up on your news feed even though you are still friends.

    • Another reason why posts may not show is that a user could selectively share information with specific friends or hide it from other people.

  3. Send friends list and new post ID to the message queue.

  4. Fanout workers fetch data from the message queue and store news feed data in the news feed cache.

    • You can think of the news feed cache as a <post_id, user_id> mapping table.

    • Whenever a new post is made, it will be appended to the news feed table.

  5. Store <post_id, user_id> in news feed cache.


Newsfeed retrieval deep dive

  1. A user sends a request to retrieve her news feed. The request looks like this: /v1/me/getFeeds

  2. The load balancer redistributes requests to web servers.

  3. Web servers call the news feed service to fetch news feeds.

  4. News feed service gets a list post IDs from the news feed cache.

  5. A user’s news feed is more than just a list of feed IDs.

    • It contains username, profile picture, post content, post image, etc.

    • Thus, the news feed service fetches the complete user and post objects from caches (user cache and post cache) to construct the fully hydrated news feed.

  6. The fully hydrated news feed is returned in JSON format back to the client for rendering.


Cache architecture

Cache is extremely important for a news feed system. We divide the cache tier into 5 layers:

  • News Feed: It stores IDs of news feeds

  • Content: It stores every post data. Popular content is stored in hot cache.

  • Social Graph: It stores user relationship data.

  • Action: It stores info about whether a user liked a post, replied a post, or took other actions on a post.

  • Counters: It stores counters for like, reply, follower, following, etc.


Other Talk Points

Scaling the database:

  • Vertical scaling vs Horizontal scaling

  • SQL vs NoSQL

  • Master-slave replication

  • Read replicas

  • Consistency models

  • Database sharding

Other talking points:

  • Keep web tier stateless

  • Cache data as much as you can

  • Support multiple data centers

  • Lose couple components with message queues

  • Monitor key metrics. For instance, QPS during peak hours and latency while users refreshing their news feed are interesting to monitor.

Last updated