Designing a Web Crawler/Archive (scott)
Designing a Web Crawler/Archive (scott)
10 mins: Business (Use Cases)/common sense (新手和老手的区别,后续提升的能力主要在这里)
10 mins: Constraints
10 mins: Architecture
10 mins: API Design
10 mins: DB choice
10 mins: Data model
Timebox 成这几个部分的好处是:如果面试的时候卡主,可以跳到下一部分
Requirements
crawl Amazon/eBay/etc, user specifies a domain name to crawl
automatically discover new URLs
periodically crawl: daily
smart schema detection (能适应要爬的不同内容)
extract certain information like product description/price etc
Goal
Main Goal
High scalable / flexibility in scale
Fault-tolerant
idempotent - 重复做没有 side effect,每次的结果应该都是一样的。所以,比如当前的 crawler fail 了,我只要重新 run 一次就够了。
Schema flexible / extensible
Not in Discussion
Server-side oAuth2
Server-side throttle
Constraints
Empirical results driven
From database perspective, data size is the limitation
gurugamer game replay: 20k, daily replay: 1.7M, could be used to estimate the total data size and forecast the foreseeable future
From the whole system perspective, server end throttling is the main limitation
Use different accounts
Use different IPs
Multi-thread
Be a good citizen, don’t bring down the server
The same node crawl several sites in parallel
crawler 的难点主要在判断和调试 bottleneck。server side 有 throttle?限制 ip 地址?有加密?etc.
Core Logic
The basic algorithm executed by any web crawler is to take a list of whitelisted domains and seed URLs as its input and repeatedly execute the following steps.
pick a URL from the unvisited URL list.
async crawl the URL
parse document contents and store data, index the contents
look for new URLs in the whitelisted; add new URLs to the list of unvisited URLs.
Architecture
实线:real-time
虚线:async processing
Scheduler
periodically or on demand trigger new crawling;
insert seed URLs
Crawler
stateless crawler - 为什么强调 stateless 呢?- crawler fail 了,任何时候重启就可以了。不需要 depend on state
async scalable worker
deep copy vs shallow copy (自己按需求定义自己的 crawler)
Parser
把 parser 从 crawler 当中分离。separate of concerns. 如果 parse fail 了,不需要再重新 craw 之前的内容。
async stateless worker
parse and validate the raw data
write the validated data to a clean store
Notification
notify the user current progress
notify staggered process, unexpected schema etc.
Sweeper
monitor the progress and re-trigger a crawl or parsing if no response
generate notification if unknown schema / exception met.
Database Choice
Transactional/easier with rich sql query: mysql / postgres
Key - value / strict access patterns / flexible schema:
dynamoDB(The cumulative size of attributes per item must fit within the maximum DynamoDB item size (400 KB).)
S3 if big object
SQL scaling requires some extra work: sharding.
Raw Data:
key-value type, read / write
Amazon S3 / dynamoDB / cassandra / mongodb
Metadata / clean data:
SQL type
require durability and transaction
实际生产中,除非是特别大的数据量。能用 SQL 优先用 SQL。1. SQL 做了大量的优化工作; 2. NoSQL 实际上是把一部分 DB 要做的工作交给了 customer。用 SQL 可以提升开发效率。
API
addScheduledCrawl(domain_names, cron_expression=null)
add a request to scheduler to start crawler
startCrawling(domain_names, iteration, blacklisted_URLs, type {on_demand, schedule})
internal API to execute/trigger crawling
Data Model
Raw Data Table (S3)
id | data | CreatedTime |
uuid | HTML/Json | 45027824 |
Clean Data (SQL)
存放不同 schema 的爬好的 data
Hash(URL, iteration) (shard key) | type | URL | iteration | field_1 |
string | Game_replay | steam.com/1 | 10 | Hero:Riki |
Hash(URL, iteration) (shard key) | type | URL | iteration | headline | main_text |
string | News | cnn.com/1 | 2 | Uber | What's new? |
Iteration Progress (SQL)
primary key: schedule_id + iteration
schedule_id | iteration | start_time | urls_found | stuck | crawled | processed |
uuid | 9 | 25469836 | 1000 | 10 | 9000 | 8600 |
Record Metadata
shard key: id = hash(url, iteration)
id | url | iteration | state | creation_time | last_update | notified | deleted |
str | games.net | 9 | QUEUED | 32654965 | 10 | No | Yes |
STATE = {Queued, Crawled, Processed, Stuck}
Metadata Scheduling Table (SQL)
id | Cron_Pattern | user | creation | name | URLs | state |
uuid | 0 8 * | yang | 24586972 | news | Link to YAML file | {ON, PAUSED, OFF} |
Fault Tolerant
Scheduler (Apache Airflow)
Heavy load processing usually stateless, send & forget
Crawler: stateless
Read the URL need to be crawled; if already crawled, no-op
Crawl the URL; store data into Raw Data table
Update record to
crawled
state
Parser: stateless
Read the record need to be parsed; if already processed / stuck, no-op
Parse and validate record:
If invalid, update record to
STUCK
and return
Store the validated record to clean data
Update record state to
PROCESSED
Sweeper: periodically wake up
Read records in
STUCK
state and not yet notifiedNotify developers to take actions.
Mark the record
notified
Read the records in
QUEUED
for a long period of timeRe-queue a crawl command
Read the records in
Crawled
for a long period of timeRe-queue a parse command
Update the iteration progress records:
Send summary notification to users
May need to consolidate if cross shards
The components are all infinite scalable.
Q & A
如果网站有 throttling 控制,是不是需要保证同一个 scheduled job 不能被不同的 workers 同时拿到?要不然虽然结果存储没有问题,但容易会触发网站的 throttling?
A: throttling 的话肯定需要分散 crawling 的 nodes.
如果爬虫数量很多,单位时间内需要触发的 job 很多,scheduler 会不会成为瓶颈?
A: 会的,可能需要 distributed scheduler.
Last updated