# Designing a Web Crawler/Archive (scott)

## Designing a Web Crawler/Archive (scott)

> Ref: <https://www.youtube.com/watch?v=awPOdsnPL-8>

* 10 mins: **Business (Use Cases)/common sense** (新手和老手的区别，后续提升的能力主要在这里)
* 10 mins: Constraints
* 10 mins: Architecture
* 10 mins: API Design
* 10 mins: DB choice
* 10 mins: Data model

> Timebox 成这几个部分的好处是：如果面试的时候卡主，可以跳到下一部分

## Requirements

* crawl Amazon/eBay/etc, user specifies a domain name to crawl
* automatically discover new URLs
* periodically crawl: daily
* **smart schema detection** (能适应要爬的不同内容)
* extract certain information like product description/price etc

## Goal

### Main Goal

* High scalable / flexibility in scale
* Fault-tolerant
  * idempotent - 重复做没有 side effect，每次的结果应该都是一样的。所以，比如当前的 crawler fail 了，我只要重新 run 一次就够了。
* Schema flexible / extensible

### Not in Discussion

* Server-side oAuth2
* Server-side throttle

## Constraints

* Empirical results driven
* From database perspective, data size is the limitation
  * gurugamer game replay: 20k, daily replay: 1.7M, could be used to estimate the total data size and forecast the foreseeable future
* From the whole system perspective, server end throttling is the main limitation
  * Use different accounts
  * Use different IPs
  * Multi-thread
  * Be a good citizen, don’t bring down the server
  * The same node crawl several sites in parallel

> crawler 的难点主要在判断和调试 bottleneck。server side 有 throttle？限制 ip 地址？有加密？etc.

## Core Logic

The basic algorithm executed by any web crawler is to take a list of whitelisted domains and **seed URLs** as its input and repeatedly execute the following steps.

* pick a URL from the unvisited URL list.
* async crawl the URL
* parse document contents and store data, index the contents
* look for new URLs in the whitelisted; add new URLs to the list of unvisited URLs.

## Architecture

![](https://2407442552-files.gitbook.io/~/files/v0/b/gitbook-legacy-files/o/assets%2F-Lpv9LvHzpublmUWisvz%2Fsync%2Fd782154f2903700e5a4934e3e0f1a14fff968286.png?generation=1631603199083410\&alt=media)

* 实线：real-time
* 虚线：async processing

### Scheduler

* periodically or on demand trigger new crawling;&#x20;
* insert seed URLs&#x20;

### Crawler

* **stateless** crawler - 为什么强调 stateless 呢？- crawler fail 了，任何时候重启就可以了。不需要 depend on state
* async scalable worker
* deep copy vs shallow copy (自己按需求定义自己的 crawler)

### Parser

> 把 parser 从 crawler 当中分离。separate of concerns. 如果 parse fail 了，不需要再重新 craw 之前的内容。

* async **stateless** worker
* parse and validate the raw data
* write the validated data to a clean store

### Notification

* notify the user current progress&#x20;
* notify staggered process, unexpected schema etc.

### Sweeper

* monitor the progress and re-trigger a crawl or parsing if no response
* generate notification if unknown schema / exception met.

## Database Choice

* **Transactional/easier with rich sql query**: mysql / postgres
* **Key - value / strict access patterns / flexible schema**:
  * dynamoDB(The cumulative size of attributes per item must fit within the maximum DynamoDB item size (400 KB).)
  * S3 if big object &#x20;
* SQL scaling requires some extra work: **sharding**.
* Raw Data:
  * key-value type, read / write
  * Amazon S3 / dynamoDB / cassandra / mongodb
* Metadata / clean data:
  * SQL type
  * require durability and transaction

> 实际生产中，除非是特别大的数据量。能用 SQL 优先用 SQL。1. SQL 做了大量的优化工作； 2. NoSQL 实际上是把一部分 DB 要做的工作交给了 customer。用 SQL 可以提升开发效率。

## API

`addScheduledCrawl(domain_names, cron_expression=null)`

* add a request to scheduler to start crawler

`startCrawling(domain_names, iteration, blacklisted_URLs, type {on_demand, schedule})`

* internal API to execute/trigger crawling

## Data Model

### Raw Data Table (S3)

| id   | data      | CreatedTime |
| ---- | --------- | ----------- |
| uuid | HTML/Json | 45027824    |

### Clean Data (SQL)

* 存放不同 schema 的爬好的 data

| Hash(URL, iteration) (shard key) | type         | URL         | iteration | field\_1  |
| -------------------------------- | ------------ | ----------- | --------- | --------- |
| string                           | Game\_replay | steam.com/1 | 10        | Hero:Riki |

| Hash(URL, iteration) (shard key) | type | URL       | iteration | headline | main\_text  |
| -------------------------------- | ---- | --------- | --------- | -------- | ----------- |
| string                           | News | cnn.com/1 | 2         | Uber     | What's new? |

### Iteration Progress (SQL)

* primary key: schedule\_id + iteration

| schedule\_id | iteration | start\_time | urls\_found | stuck | crawled | processed |
| ------------ | --------- | ----------- | ----------- | ----- | ------- | --------- |
| uuid         | 9         | 25469836    | 1000        | 10    | 9000    | 8600      |

### Record Metadata

* shard key: id = hash(url, iteration)

| id  | url       | iteration | state  | creation\_time | last\_update | notified | deleted |
| --- | --------- | --------- | ------ | -------------- | ------------ | -------- | ------- |
| str | games.net | 9         | QUEUED | 32654965       | 10           | No       | Yes     |

* STATE = {Queued, Crawled, Processed, Stuck}

### Metadata Scheduling Table (SQL)

| id   | Cron\_Pattern | user | creation | name | URLs              | state             |
| ---- | ------------- | ---- | -------- | ---- | ----------------- | ----------------- |
| uuid | 0 8   \*      | yang | 24586972 | news | Link to YAML file | {ON, PAUSED, OFF} |

## Fault Tolerant

### Scheduler (Apache Airflow)

```
1. Read YAML: what to crawl
2. Insert seed in records metadata, no-op if exists
    a.Send and forget to crawler
3. Insert progress in iteration progress table, no-op if exists
4. If fault, go back to state 1.
```

Heavy load processing usually **stateless**, **send & forget**

### Crawler: stateless

* Read the URL need to be crawled; if already crawled, no-op
* Crawl the URL; store data into Raw Data table
* Update record to `crawled` state

### Parser: stateless

* Read the record need to be parsed; if already processed / stuck, no-op
* Parse and validate record:
  * If invalid, update record to `STUCK` and return
* Store the validated record to clean data
* Update record state to `PROCESSED`

### Sweeper: periodically wake up

* Read records in `STUCK` state and not yet notified
  * Notify developers to take actions.
  * Mark the record `notified`
* Read the records in `QUEUED` for a long period of time
  * Re-queue a crawl command
* Read the records in `Crawled` for a long period of time
  * Re-queue a parse command
* Update the iteration progress records:
  * Send summary notification to users
  * May need to consolidate if cross shards

The components are all infinite scalable.

## Q & A

> 如果网站有 throttling 控制，是不是需要保证同一个 scheduled job 不能被不同的 workers 同时拿到？要不然虽然结果存储没有问题，但容易会触发网站的 throttling?

A: throttling 的话肯定需要分散 crawling 的 nodes.

> 如果爬虫数量很多，单位时间内需要触发的 job 很多，scheduler 会不会成为瓶颈？

A: 会的，可能需要 distributed scheduler.
