Design Autocomplete System
Design Autocomplete System
Requirement
Functional requirements
Google auto complete suggestions
How many suggestions: 10
Use frequency
How long to track data: 10 days
English
User -> customization
Location -> focus on US market first
Non functional requirements
Low latency
High QPS: 20 characters -> call 20 times
Scalability
Availability
Reliability
Traffic Estimation
Data points
daily search: 5 billion
new search: 5 billion * 0.2 = 1 billion
Estimation
QPS: 5 billion / 86400 = 60,000 queries/s
Storage:
20 characters, ASCII code, 20 bytes
1 billion * 20 bytes = 20 GB / day
10 day * 20 GB = 200 GB -> can put everything in memory
Data Structure - Trie
What is a trie?
A trie is a tree-like data structure.
The root represents an empty string.
Each node stores a character and has 26 children, one for each possible character. To save space, we do not draw empty links.
Each tree node represents a single word or a prefix string.
Basic trie data structure stores characters in nodes. To support sorting by frequency, frequency info needs to be included in nodes. Assume we have the following frequency table.
Steps to get top k most searched queries
Find the prefix. Time complexity: .
Traverse the subtree from the prefix node to get all valid children. A child is valid if it can form a valid query string. Time complexity:
Sort the children and get top k. Time complexity:
Assume k equals to 2 and a user types “tr” in the search box. The algorithm works as follows:
Step 1: Find the prefix node “tr”.
Step 2: Traverse the subtree to get all valid children. In this case, nodes [tree: 10], [true: 35], [try: 29] are valid.
Step 3: Sort the children and get top 2. [true: 35] and [try: 29] are the top 2 queries with prefix “tr”.
The time complexity of this algorithm is the sum of time spent on each step mentioned above:
Optimization for trie
1. Limit the max length of a prefix
Users rarely type a long search query into the search box. Thus, it is safe to say p is a small integer number, say 50. If we limit the length of a prefix, the time complexity for “Find the prefix” can be reduced from O(p) to O(small constant), aka O(1).
2. Cache top search queries at each node
To avoid traversing the whole trie, we store top k most frequently used queries at each node.
In our specific case, only the top 5 search queries are cached.
However, this design requires a lot of space to store top queries at every node.
Let us revisit the time complexity of the algorithm after applying those two optimizations:
Find the prefix node. Time complexity:
Return top k. Since top k queries are cached, the time complexity for this step is
High-level design
A search query is sent to the load balancer.
The load balancer routes the request to API servers.
API servers get trie data from Trie Cache and construct autocomplete suggestions for the client.
In case the data is not in Trie Cache, we replenish data back to the cache. This way, all subsequent requests for the same prefix are returned from the cache. A cache miss can happen when a cache server is out of memory or offline.
Detailed Design
Data gathering service
Analytics Logs
It stores raw data about search queries. Logs are append-only and are not indexed.
Aggregators
The size of analytics logs is usually very large, and data is not in the right format. We need to aggregate data so it can be easily processed by our system.
Depending on the use case, we may aggregate data differently. For real-time applications such as Twitter, we aggregate data in a shorter time interval as real-time results are important. On the other hand, aggregating data less frequently, say once per week, might be good enough for many use cases. During an interview session, verify whether real-time results are important. We assume trie is rebuilt weekly.
Aggregated Data
Workers
Workers are a set of servers that perform asynchronous jobs at regular intervals. They build the trie data structure and store it in Trie DB.
Trie Cache
Trie Cache is a distributed cache system that keeps trie in memory for fast read. It takes a weekly snapshot of the DB.
Query service
Query service requires lightning-fast speed. We propose the following optimizations:
AJAX request. For web applications, browsers usually send AJAX requests to fetch autocomplete results. The main benefit of AJAX is that sending/receiving a request/response does not refresh the whole web page.
Browser caching. For many applications, autocomplete search suggestions may not change much within a short time. Thus, autocomplete suggestions can be saved in browser cache to allow subsequent requests to get results from the cache directly. Google search engine uses the same cache mechanism.
Trie Operations
Create
Trie is created by workers using aggregated data. The source of data is from Analytics Log/DB.
Update
There are two ways to update the trie.
Option 1: Update the trie weekly. Once a new trie is created, the new trie replaces the old one.
Option 2: Update individual trie node directly. We try to avoid this operation because it is slow. However, if the size of the trie is small, it is an acceptable solution. When we update a trie node, its ancestors all the way up to the root must be updated because ancestors store top queries of children. Figure 13-13 shows an example of how the update operation works. On the left side, the search query “beer” has the original value 10. On the right side, it is updated to 30. As you can see, the node and its ancestors have the “beer” value updated to 30.
Delete
We have to remove hateful, violent, sexually explicit, or dangerous autocomplete suggestions. We add a filter layer (Figure 13-14) in front of the Trie Cache to filter out unwanted suggestions. Having a filter layer gives us the flexibility of removing results based on different filter rules. Unwanted suggestions are removed physically from the database asynchronically so the correct data set will be used to build trie in the next update cycle.
Database
Trie DB is the persistent storage. Two options are available to store the data:
Document store
Since a new trie is built weekly, we can periodically take a snapshot of it, serialize it, and store the serialized data in the database. Document stores like MongoDB are good fits for serialized data.
Key-value store
A trie can be represented in a hash table form by applying the following logic:
Every prefix in the trie is mapped to a key in a hash table.
Data on each trie node is mapped to a value in a hash table.
Scaling
Naive way - shard based on first character
If we need two servers for storage, we can store queries starting with ‘a’ to ‘m’ on the first server, and ‘n’ to ‘z’ on the second server.
If we need three servers, we can split queries into ‘a’ to ‘i’, ‘j’ to ‘r’ and ‘s’ to ‘z’.
Con: uneven distribution
Optimized way
analyze historical data distribution pattern and apply smarter sharding logic
The shard map manager maintains a lookup database for identifying where rows should be stored.
For example, if there are a similar number of historical queries for ‘s’ and for ‘u’, ‘v’, ‘w’, ‘x’, ‘y’ and ‘z’ combined, we can maintain two shards: one for ‘s’ and one for ‘u’ to ‘z’.
Other question
How do you extend your design to support multiple languages?
store Unicode characters in trie nodes
What if top search queries in one country are different from others?
might build different tries for different countries
we can store them in CDNs
How can we support the trending (real-time) search queries?
Reduce the working data set by sharding
Change the ranking model and assign more weight to recent search queries
Data may come as streams, so we do not have access to all the data at once. Streaming data means data is generated continuously. Stream processing requires a different set of systems: Apache Hadoop MapReduce, Apache Spark Streaming, Apache Storm, Apache Kafka, etc. Because all those topics require specific domain knowledge, we are not going into detail here.
Last updated