Designing a Web Crawler/Archive (scott)

Designing a Web Crawler/Archive (scott)

Ref: https://www.youtube.com/watch?v=awPOdsnPL-8

  • 10 mins: Business (Use Cases)/common sense (新手和老手的区别,后续提升的能力主要在这里)

  • 10 mins: Constraints

  • 10 mins: Architecture

  • 10 mins: API Design

  • 10 mins: DB choice

  • 10 mins: Data model

Timebox 成这几个部分的好处是:如果面试的时候卡主,可以跳到下一部分

Requirements

  • crawl Amazon/eBay/etc, user specifies a domain name to crawl

  • automatically discover new URLs

  • periodically crawl: daily

  • smart schema detection (能适应要爬的不同内容)

  • extract certain information like product description/price etc

Goal

Main Goal

  • High scalable / flexibility in scale

  • Fault-tolerant

    • idempotent - 重复做没有 side effect,每次的结果应该都是一样的。所以,比如当前的 crawler fail 了,我只要重新 run 一次就够了。

  • Schema flexible / extensible

Not in Discussion

  • Server-side oAuth2

  • Server-side throttle

Constraints

  • Empirical results driven

  • From database perspective, data size is the limitation

    • gurugamer game replay: 20k, daily replay: 1.7M, could be used to estimate the total data size and forecast the foreseeable future

  • From the whole system perspective, server end throttling is the main limitation

    • Use different accounts

    • Use different IPs

    • Multi-thread

    • Be a good citizen, don’t bring down the server

    • The same node crawl several sites in parallel

crawler 的难点主要在判断和调试 bottleneck。server side 有 throttle?限制 ip 地址?有加密?etc.

Core Logic

The basic algorithm executed by any web crawler is to take a list of whitelisted domains and seed URLs as its input and repeatedly execute the following steps.

  • pick a URL from the unvisited URL list.

  • async crawl the URL

  • parse document contents and store data, index the contents

  • look for new URLs in the whitelisted; add new URLs to the list of unvisited URLs.

Architecture

  • 实线:real-time

  • 虚线:async processing

Scheduler

  • periodically or on demand trigger new crawling;

  • insert seed URLs

Crawler

  • stateless crawler - 为什么强调 stateless 呢?- crawler fail 了,任何时候重启就可以了。不需要 depend on state

  • async scalable worker

  • deep copy vs shallow copy (自己按需求定义自己的 crawler)

Parser

把 parser 从 crawler 当中分离。separate of concerns. 如果 parse fail 了,不需要再重新 craw 之前的内容。

  • async stateless worker

  • parse and validate the raw data

  • write the validated data to a clean store

Notification

  • notify the user current progress

  • notify staggered process, unexpected schema etc.

Sweeper

  • monitor the progress and re-trigger a crawl or parsing if no response

  • generate notification if unknown schema / exception met.

Database Choice

  • Transactional/easier with rich sql query: mysql / postgres

  • Key - value / strict access patterns / flexible schema:

    • dynamoDB(The cumulative size of attributes per item must fit within the maximum DynamoDB item size (400 KB).)

    • S3 if big object

  • SQL scaling requires some extra work: sharding.

  • Raw Data:

    • key-value type, read / write

    • Amazon S3 / dynamoDB / cassandra / mongodb

  • Metadata / clean data:

    • SQL type

    • require durability and transaction

实际生产中,除非是特别大的数据量。能用 SQL 优先用 SQL。1. SQL 做了大量的优化工作; 2. NoSQL 实际上是把一部分 DB 要做的工作交给了 customer。用 SQL 可以提升开发效率。

API

addScheduledCrawl(domain_names, cron_expression=null)

  • add a request to scheduler to start crawler

startCrawling(domain_names, iteration, blacklisted_URLs, type {on_demand, schedule})

  • internal API to execute/trigger crawling

Data Model

Raw Data Table (S3)

id

data

CreatedTime

uuid

HTML/Json

45027824

Clean Data (SQL)

  • 存放不同 schema 的爬好的 data

Hash(URL, iteration) (shard key)

type

URL

iteration

field_1

string

Game_replay

steam.com/1

10

Hero:Riki

Hash(URL, iteration) (shard key)

type

URL

iteration

headline

main_text

string

News

cnn.com/1

2

Uber

What's new?

Iteration Progress (SQL)

  • primary key: schedule_id + iteration

schedule_id

iteration

start_time

urls_found

stuck

crawled

processed

uuid

9

25469836

1000

10

9000

8600

Record Metadata

  • shard key: id = hash(url, iteration)

id

url

iteration

state

creation_time

last_update

notified

deleted

str

games.net

9

QUEUED

32654965

10

No

Yes

  • STATE = {Queued, Crawled, Processed, Stuck}

Metadata Scheduling Table (SQL)

id

Cron_Pattern

user

creation

name

URLs

state

uuid

0 8 *

yang

24586972

news

Link to YAML file

{ON, PAUSED, OFF}

Fault Tolerant

Scheduler (Apache Airflow)

1. Read YAML: what to crawl
2. Insert seed in records metadata, no-op if exists
    a.Send and forget to crawler
3. Insert progress in iteration progress table, no-op if exists
4. If fault, go back to state 1.

Heavy load processing usually stateless, send & forget

Crawler: stateless

  • Read the URL need to be crawled; if already crawled, no-op

  • Crawl the URL; store data into Raw Data table

  • Update record to crawled state

Parser: stateless

  • Read the record need to be parsed; if already processed / stuck, no-op

  • Parse and validate record:

    • If invalid, update record to STUCK and return

  • Store the validated record to clean data

  • Update record state to PROCESSED

Sweeper: periodically wake up

  • Read records in STUCK state and not yet notified

    • Notify developers to take actions.

    • Mark the record notified

  • Read the records in QUEUED for a long period of time

    • Re-queue a crawl command

  • Read the records in Crawled for a long period of time

    • Re-queue a parse command

  • Update the iteration progress records:

    • Send summary notification to users

    • May need to consolidate if cross shards

The components are all infinite scalable.

Q & A

如果网站有 throttling 控制,是不是需要保证同一个 scheduled job 不能被不同的 workers 同时拿到?要不然虽然结果存储没有问题,但容易会触发网站的 throttling?

A: throttling 的话肯定需要分散 crawling 的 nodes.

如果爬虫数量很多,单位时间内需要触发的 job 很多,scheduler 会不会成为瓶颈?

A: 会的,可能需要 distributed scheduler.

Last updated