Designing a Web Crawler/Archive (scott)

Ref: https://www.youtube.com/watch?v=awPOdsnPL-8

10 mins: Business (Use Cases)/common sense (新手和老手的区别，后续提升的能力主要在这里)
10 mins: Constraints
10 mins: Architecture
10 mins: API Design
10 mins: DB choice
10 mins: Data model

Timebox 成这几个部分的好处是：如果面试的时候卡主，可以跳到下一部分

Requirements

crawl Amazon/eBay/etc, user specifies a domain name to crawl
automatically discover new URLs
periodically crawl: daily
smart schema detection (能适应要爬的不同内容)
extract certain information like product description/price etc

Goal

Main Goal

High scalable / flexibility in scale
Fault-tolerant
- idempotent - 重复做没有 side effect，每次的结果应该都是一样的。所以，比如当前的 crawler fail 了，我只要重新 run 一次就够了。
Schema flexible / extensible

Not in Discussion

Server-side oAuth2
Server-side throttle

Constraints

Empirical results driven
From database perspective, data size is the limitation
- gurugamer game replay: 20k, daily replay: 1.7M, could be used to estimate the total data size and forecast the foreseeable future
From the whole system perspective, server end throttling is the main limitation
- Use different accounts
- Use different IPs
- Multi-thread
- Be a good citizen, don’t bring down the server
- The same node crawl several sites in parallel

crawler 的难点主要在判断和调试 bottleneck。server side 有 throttle？限制 ip 地址？有加密？etc.

Core Logic

The basic algorithm executed by any web crawler is to take a list of whitelisted domains and seed URLs as its input and repeatedly execute the following steps.

pick a URL from the unvisited URL list.
async crawl the URL
parse document contents and store data, index the contents
look for new URLs in the whitelisted; add new URLs to the list of unvisited URLs.

Architecture

实线：real-time
虚线：async processing

Scheduler

periodically or on demand trigger new crawling;
insert seed URLs

Crawler

stateless crawler - 为什么强调 stateless 呢？- crawler fail 了，任何时候重启就可以了。不需要 depend on state
async scalable worker
deep copy vs shallow copy (自己按需求定义自己的 crawler)

Parser

把 parser 从 crawler 当中分离。separate of concerns. 如果 parse fail 了，不需要再重新 craw 之前的内容。

async stateless worker
parse and validate the raw data
write the validated data to a clean store

Notification

notify the user current progress
notify staggered process, unexpected schema etc.

Sweeper

monitor the progress and re-trigger a crawl or parsing if no response
generate notification if unknown schema / exception met.

Database Choice

Transactional/easier with rich sql query: mysql / postgres
Key - value / strict access patterns / flexible schema:
- dynamoDB(The cumulative size of attributes per item must fit within the maximum DynamoDB item size (400 KB).)
- S3 if big object
SQL scaling requires some extra work: sharding.
Raw Data:
- key-value type, read / write
- Amazon S3 / dynamoDB / cassandra / mongodb
Metadata / clean data:
- SQL type
- require durability and transaction

实际生产中，除非是特别大的数据量。能用 SQL 优先用 SQL。1. SQL 做了大量的优化工作； 2. NoSQL 实际上是把一部分 DB 要做的工作交给了 customer。用 SQL 可以提升开发效率。

API

addScheduledCrawl(domain_names, cron_expression=null)

add a request to scheduler to start crawler

startCrawling(domain_names, iteration, blacklisted_URLs, type {on_demand, schedule})

internal API to execute/trigger crawling

Data Model

Raw Data Table (S3)

data

CreatedTime

uuid

HTML/Json

45027824

Clean Data (SQL)

存放不同 schema 的爬好的 data

Hash(URL, iteration) (shard key)

type

URL

iteration

field_1

string

Game_replay

steam.com/1

Hero:Riki

Hash(URL, iteration) (shard key)

type

URL

iteration

headline

main_text

string

News

cnn.com/1

Uber

What's new?

Iteration Progress (SQL)

primary key: schedule_id + iteration

schedule_id

iteration

start_time

urls_found

stuck

crawled

processed

uuid

25469836

1000

9000

8600

Record Metadata

shard key: id = hash(url, iteration)

url

iteration

state

creation_time

last_update

notified

deleted

str

games.net

QUEUED

32654965

Yes

STATE = {Queued, Crawled, Processed, Stuck}

Metadata Scheduling Table (SQL)

Cron_Pattern

user

creation

name

URLs

state

uuid

0 8 *

yang

24586972

news

Link to YAML file

{ON, PAUSED, OFF}

Fault Tolerant

Scheduler (Apache Airflow)

1. Read YAML: what to crawl
2. Insert seed in records metadata, no-op if exists
    a.Send and forget to crawler
3. Insert progress in iteration progress table, no-op if exists
4. If fault, go back to state 1.

Heavy load processing usually stateless, send & forget

Crawler: stateless

Read the URL need to be crawled; if already crawled, no-op
Crawl the URL; store data into Raw Data table
Update record to crawled state

Parser: stateless

Read the record need to be parsed; if already processed / stuck, no-op
Parse and validate record:
- If invalid, update record to STUCK and return
Store the validated record to clean data
Update record state to PROCESSED

Sweeper: periodically wake up

Read records in STUCK state and not yet notified
- Notify developers to take actions.
- Mark the record notified
Read the records in QUEUED for a long period of time
- Re-queue a crawl command
Read the records in Crawled for a long period of time
- Re-queue a parse command
Update the iteration progress records:
- Send summary notification to users
- May need to consolidate if cross shards

The components are all infinite scalable.

Q & A

如果网站有 throttling 控制，是不是需要保证同一个 scheduled job 不能被不同的 workers 同时拿到？要不然虽然结果存储没有问题，但容易会触发网站的 throttling?

A: throttling 的话肯定需要分散 crawling 的 nodes.

如果爬虫数量很多，单位时间内需要触发的 job 很多，scheduler 会不会成为瓶颈？

A: 会的，可能需要 distributed scheduler.

PreviousDesigning a distributed job scheduler NextDesign a web crawler

Last updated 3 years ago

Was this helpful?