Design a Web Crawler¶

What We're Building¶

A web crawler (also called a spider or bot) systematically browses the internet to discover and download web pages. It starts with a set of seed URLs, downloads their content, extracts new URLs from those pages, and continues recursively.

Simple example:

Start with: https://example.com
  → Download page
  → Extract links: /about, /products, /blog
  → Add to queue: example.com/about, example.com/products, example.com/blog
  → Download each, extract more links
  → Continue...

Why Web Crawlers Matter¶

Use Case	Description
Search engines	Google, Bing crawl billions of pages to index
SEO tools	Ahrefs, Moz crawl to analyze backlinks
Price monitoring	Track competitor prices across e-commerce
Content aggregation	News aggregators collect articles
Archiving	Wayback Machine preserves the web
Security scanning	Find vulnerabilities across websites
Research	Collect datasets for ML, analytics

The Scale Challenge¶

Crawler	Pages Crawled	Notes
Googlebot	Billions/day	Most comprehensive
Bingbot	Hundreds of millions/day	#2 search engine
Common Crawl	~3 billion pages/month	Open dataset

At this scale, a crawler must handle: - Billions of URLs to track - Millions of concurrent connections - Petabytes of storage - Politeness (don't overload servers) - Freshness (re-crawl changed pages) - Deduplication (same content, different URLs)

Step 1: Requirements Clarification¶

Questions to Ask¶

Question	Why It Matters
What's the purpose?	Search engine vs. monitoring vs. archiving
Scale?	1000 pages vs. billions
Domains to crawl?	Single site vs. entire internet
Freshness requirements?	Real-time vs. weekly vs. never
Content types?	HTML only vs. PDFs, images, etc.
Politeness rules?	Respect robots.txt? Rate limits?

Functional Requirements¶

Requirement	Priority	Description
Crawl web pages from seed URLs	Must have	Core functionality
Extract and follow links	Must have	Discover new pages
Handle robots.txt	Must have	Respect site preferences
Store crawled content	Must have	For indexing/processing
Avoid duplicate crawling	Must have	Don't waste resources
Prioritize important pages	Nice to have	PageRank, freshness
Handle dynamic content (JS)	Nice to have	Single Page Apps

Non-Functional Requirements¶

Requirement	Target	Rationale
Throughput	1,000+ pages/second	Crawl internet in reasonable time
Scalability	Horizontal	Add machines to crawl faster
Politeness	Respect robots.txt, rate limits	Don't get blocked, be ethical
Robustness	Handle failures gracefully	Web is messy
Freshness	Re-crawl based on change frequency	Keep index current

Scope for This Design¶

Let's design a general-purpose web crawler that can: - Crawl the entire web (billions of pages) - Process 1,000 pages/second - Run continuously with periodic re-crawling - Respect robots.txt and politeness

Technology Selection & Tradeoffs¶

Choosing components is not about picking the “best” technology in isolation—it is about matching throughput, operational cost, failure modes, and team skills to the crawler’s workload. Below are common options interviewers expect you to compare.

URL frontier¶

Option	Strengths	Weaknesses	Best when
Redis (sorted sets, streams, lists)	Low latency, familiar ops, good for per-host scoring and short queues	Memory-bound; cross-region consistency is hard; not a durable log by default	Medium scale, need fast dequeue + scoring; combine with persistence (AOF/RDB) or replay source
Kafka (partitioned log)	Durable, replayable, horizontal scale, natural backpressure	Higher latency than Redis; ordering is per-partition; consumer lag is an ops concern	Large-scale distributed crawl; need audit trail and reprocessing of URL batches
Custom in-process priority queue (per worker + coordination)	Maximum control over politeness heuristics	Does not distribute by itself; single-machine limits	Prototypes, single-DC vertical scale, or as a local buffer in front of a distributed queue
Database-backed (e.g., PostgreSQL/Cassandra with indexed `next_crawl_time`)	Strong durability; easy queries for “what to crawl next”	Can become a write/read hotspot; needs careful sharding and batch dequeue	Crawl freshness driven by schedules; smaller QPS or when queue logic maps cleanly to SQL

Why it matters: The frontier is the system’s throttle and fairness layer. Kafka favors durability and replay; Redis favors speed; DB-backed favors queryable schedules at the cost of hot rows.

Content storage¶

Option	Strengths	Weaknesses	Best when
S3-compatible object storage	Infinite scale, cheap cold storage, lifecycle policies, multi-AZ durability	Not a low-latency random-update store; listing at huge prefix scale needs design	Default choice for raw HTML blobs and WARC-style archives
HDFS	Tight integration with Hadoop/Spark batch analytics	Heavier ops; cloud-native teams often prefer object storage	On-prem big-data stacks; batch pipelines already on Hadoop
Cassandra	High write throughput, wide-column model for metadata	Wrong tool for multi-MB blobs; operational complexity	Metadata and crawl state, not primary blob store
MongoDB	Flexible documents, good for moderate metadata	Large binary payloads are expensive; sharding ops for TB+ blobs	Smaller crawls or document-centric secondary indices—not typical for petabyte HTML

Why it matters: Blobs (HTML) belong in cheap, durable object storage; metadata (URL hash, timestamps, status) belongs in a row-oriented or wide-column store optimized for lookups and writes.

URL deduplication¶

Option	Strengths	Weaknesses	Best when
Bloom filter	Tiny memory per URL for huge sets; extremely fast	False positives (may skip never-seen URLs); no delete without rebuild/rehashing	First-line negative cache in front of a definitive store
Redis SET (or HyperLogLog for counts only)	Exact membership for keys in RAM; simple API	Memory grows with unique URLs; cross-region sync is non-trivial	Medium corpora, or per-shard exact sets
RocksDB (or other LSM KV)	Disk-backed exact set at scale; good write throughput; range scans	Tunable compaction/ops; not a probabilistic filter	Authoritative “seen URL” store at billion+ scale

Why it matters: Interviewers want the two-tier pattern: Bloom (or similar) for cheap “probably seen”, plus RocksDB/DB for “definitely seen” to eliminate false positives on enqueue.

DNS resolution¶

Option	Strengths	Weaknesses	Best when
Local cache (per worker or shared Redis)	Fast repeat lookups; reduces resolver load	Stale TTL handling; cold start on new hosts	Always—every fetch path should cache by `(host, ttl)`
Dedicated DNS resolver pool (recursive resolvers you operate)	Rate limiting, monitoring, retry policy centralized	More infra; must respect resolver policies and avoid becoming abusive	High QPS crawl; need observability and fairness across workers

Why it matters: DNS is easy to underestimate. A shared cache avoids thundering herds; dedicated resolvers isolate failures and apply global rate limits to lookups.

HTML parsing¶

Option	Strengths	Weaknesses	Best when
Lightweight parser (jsoup, BeautifulSoup, lxml)	Fast, low CPU/memory; great for static HTML	No JS execution	Default for link extraction and main content on server-rendered pages
Headless browser (Playwright, Puppeteer)	Renders SPAs, executes JS	10–100× cost per page; harder to scale	Known JS-heavy hosts or after HTTP fetch returns shell HTML

Why it matters: Use the cheapest tool that satisfies coverage; reserve headless for a fraction of URLs to protect throughput and cost.

Our choice¶

Area	Choice	Rationale
Frontier	Kafka (partitioned by host hash) + Redis (hot scoring / delays)	Durable backlog and replay for failures; Redis for low-latency politeness metadata if needed
Content	S3 (gzip)	Cost and durability for petabyte-scale blobs; standard for this problem
Dedup	Bloom (cluster-wide) + RocksDB (per shard, exact)	Minimize RAM while bounding false positives with a ground-truth store
DNS	Shared cache + dedicated resolver tier at high QPS	Cache hits dominate; resolvers isolate abuse and ease monitoring
Parsing	Lightweight parser default + headless on allowlist	Maximizes pages/sec while still covering SPAs where required

CAP Theorem Analysis¶

CAP (Consistency, Availability, Partition tolerance) is a useful mental model, not a literal database knob—especially for a batch-leaning crawler where “consistency” means no wasted work and correct politeness, not linearizable reads everywhere.

How to frame it in an interview¶

Partition tolerance (P) is non-negotiable in a distributed crawler: networks split, brokers restart, regions fail.
You then choose per subsystem whether you favor stronger consistency (dedup, crawl commitments) or liveness (keeping the frontier moving).

Per-component view¶

Component	CAP leaning	Consistency concern	Availability / liveness
URL deduplication (authoritative)	CP	Must not permanently wrongfully skip URLs (mitigate Bloom FP with DB check)	Brief unavailability may stall enqueue; prefer retry over wrong skip
Frontier queue	AP	Duplicate deliveries are acceptable idempotently	Must keep dequeuing under failures (at-least-once is OK)
Politeness / rate state	AP with convergence	Per-host clocks can temporarily violate delay; should self-correct	Crawl should not stop globally if one shard’s delay cache is stale
Content store (S3)	AP (eventual listing/consistency models vary)	Uniqueness via content hash + key idempotency	Writes succeed independently per object
Crawl metadata DB	Tunable (often CP within quorum)	Same URL should not diverge into contradictory “never crawled” vs “stored”	Reads can be eventually consistent for dashboards

Why dedup “must be consistent” but frontier “can be AP”: Duplicate frontier work wastes bandwidth but is safe if dedup and storage are idempotent. Wrong dedup loses coverage—harder to fix than wasting a fetch.

Diagram¶

flowchart LR subgraph CP["CP-leaning (ground truth)"] Dedup[(Authoritative dedup\nRocksDB / DB)] Meta[(Crawl metadata\ncommitted state)] end subgraph AP["AP-leaning (keep moving)"] Frontier[[Frontier\nKafka / Redis]] Polite[Politeness cache] end Frontier -->|"at-least-once URL"| Workers Workers --> Dedup Workers --> Meta Polite -.->|"best-effort delay"| Workers

Under a partition, prefer: frontier may duplicate, workers may re-fetch occasionally, but dedup + idempotent writes prevent permanent inconsistency in “what the corpus contains.”

SLA and SLO Definitions¶

SLAs are contracts (often with customers); SLOs are internal targets measured by SLIs. For a crawler, SLOs align coverage, freshness, ethics, and cost.

Service level indicators (SLIs) and objectives (SLOs)¶

Dimension	SLI (what we measure)	Example SLO	Notes
Crawl throughput	Successful `200` HTML fetches / sec (or / min sustained)	≥ 400 pages/sec monthly 99th percentile of 5-min windows	Exclude intentionally skipped (robots, non-HTML); align with capacity plan
Crawl freshness	Time between successful crawls of same URL (p50 / p95)	p95 ≤ 14 days for “news” class; p95 ≤ 90 days for static class	Requires URL tiering; single global number is a red flag in interviews
Politeness	Requests per host per second (max over 1-min window)	≤ 1 rps/host default unless `robots` allows more	Measure per crawler User-Agent aggregate if multiple shards hit same host
Dedup accuracy	Ratio of false “already seen” decisions that skip new URLs (sampled audit)	≤ 0.01% bad skips after Bloom + DB verification	Bloom alone cannot hit this—audit matters
Content durability	Objects in blob store with successful periodic HEAD/verify or replication checks	≥ 99.999999999% annual durability (provider SLA) for stored objects	Pair with checksum on write; lifecycle to detect corruption

Error budget policy¶

Element	Policy
Budget	If monthly error budget is consumed (e.g., freshness SLO misses > 5% of URLs in a tier), freeze new features (e.g., new JS rendering modes) until recovery.
Burn alerts	Fast burn: budget will exhaust in days; page on-call, scale workers, shed low-priority tiers. Slow burn: investigate scheduler drift, bad seeds, or dedup regression.
Tradeoffs	Spending budget on higher throughput may consume politeness budget—explicitly cap QPS before raising concurrency.

Interview tip: Tie each SLO to a dashboard and an alert; mention tiered freshness so you do not promise one number for the whole web.

Database Schema¶

Schemas separate what must be queryable (scheduling, status, graph analytics) from blobs (stored in object storage). Types are illustrative—adjust for your SQL/NoSQL choice.

`url_frontier` (work queue + scheduling)¶

Column	Type	Description
`url_id`	`UUID` or `BYTEA` (hash)	Stable id; hash of canonical URL if UUID not used
`url`	`TEXT`	Canonical URL
`priority`	`DOUBLE` or `BIGINT`	Crawl priority (PageRank, freshness score, tier)
`state`	`ENUM`	e.g. `pending`, `scheduled`, `leased`, `fetched`, `failed`, `skipped_robots`
`next_crawl_time`	`TIMESTAMPTZ`	When the URL is eligible again (freshness / retry backoff)
`lease_owner`	`TEXT` (nullable)	Worker or session id if using lease-based dequeue
`lease_expires_at`	`TIMESTAMPTZ` (nullable)	For stuck work reclaim
`discovered_from`	`TEXT` (nullable)	Provenance for debugging (optional)

Indexes: (state, next_crawl_time), (priority DESC, next_crawl_time) for batch selection; shard key often hash(url) or host.

`crawled_pages` (fetch outcome + pointer to blob)¶

Column	Type	Description
`url_id`	same as frontier	FK or matching hash
`url`	`TEXT`	Final URL after redirects
`content_hash`	`BYTEA` (SHA-256)	Deduplication and integrity
`crawl_timestamp`	`TIMESTAMPTZ`	End of successful fetch
`http_status`	`SMALLINT`	e.g. 200, 404, 429
`response_headers`	`JSONB`	Subset: `Content-Type`, `ETag`, `Last-Modified`
`storage_key`	`TEXT`	S3 key or equivalent
`content_length`	`BIGINT`	Bytes stored (compressed or raw, document clearly)

Indexes: content_hash (for duplicate detection), crawl_timestamp (for incremental exports).

`link_graph` (optional, for ranking / debugging)¶

Column	Type	Description
`source_url_id`	`UUID` / hash	Outlink source
`target_url_id`	`UUID` / hash	Outlink target
`anchor_text`	`TEXT` (nullable)	Useful for ranking; can be omitted at huge scale
`discovered_at`	`TIMESTAMPTZ`	When edge was seen

Primary key / index: (source_url_id, target_url_id); optional inverted index on target_url_id for backlink queries (expensive at scale—often sampled).

API Design¶

External APIs control the crawl (seeds, politeness) and observe it (status, results). All writes should be authenticated; rate-limit public endpoints.

Overview¶

Endpoint	Method	Purpose
`/v1/seeds`	`POST`	Submit seed URLs (and optional crawl tier / max depth)
`/v1/crawls/{crawl_job_id}`	`GET`	Job status: queued, running, completed, failed counts
`/v1/hosts/{host}/politeness`	`PUT`	Set rps, crawl-delay override, or pause for a host
`/v1/results`	`GET`	Query crawl results by URL or time range (paged)

`POST /v1/seeds`¶

Body (JSON):

{
  "seeds": ["https://example.com/", "https://news.example.com/"],
  "job_name": "monthly-recrawl",
  "priority": "normal",
  "respect_robots": true
}

Response 202 Accepted: { "crawl_job_id": "...", "queued_urls": 2 } — enqueue is async.

Why: Seeds can be large; accept quickly, process in the frontier.

`GET /v1/crawls/{crawl_job_id}`¶

Response 200:

{
  "crawl_job_id": "...",
  "state": "running",
  "submitted_at": "2026-04-05T12:00:00Z",
  "stats": {
    "fetched_ok": 1200000,
    "failed": 8900,
    "skipped_robots": 45000,
    "queue_depth_estimate": 50000000
  }
}

Why: Operators need aggregate progress, not per-URL polling.

`PUT /v1/hosts/{host}/politeness`¶

Body:

{
  "max_rps": 0.5,
  "crawl_delay_ms": 2000,
  "paused": false,
  "effective_until": "2026-04-12T00:00:00Z"
}

Response 204 No Content on success.

Why: Incidents (429 storms, site owner contact) require fast policy changes without redeploying workers.

`GET /v1/results`¶

Query params: url (exact), since, until, page_size, page_token.

Response: Paged list of { "url", "content_hash", "crawl_timestamp", "http_status", "storage_key" }.

Why: Downstream indexers consume metadata + pointer; bulk export may use separate batch APIs or data lake.

Error model (consistent)¶

HTTP	When
`400`	Invalid URL, bad host, conflicting politeness
`401` / `403`	Auth failures
`404`	Unknown `crawl_job_id` or host rule
`429`	Client exceeded API rate limit (distinct from origin `429`)
`503`	Frontier overload; client should retry with backoff

Step 2: Back-of-Envelope Estimation¶

Scale Assumptions¶

Target: Crawl 1 billion web pages per month
Average page size: 500 KB (HTML + headers)
Crawl window: 30 days continuous
Re-crawl frequency: 30% of pages re-crawled per month

Traffic Estimation¶

Pages per day     = 1B / 30 = ~33 million pages/day
Pages per second  = 33M / 86,400 ≈ 380 pages/sec → round to 400 pages/sec
Peak (3x)         = 1,200 pages/sec

Bandwidth Estimation¶

Average page size       = 500 KB
Bandwidth (avg)         = 400 pages/sec × 500 KB = 200 MB/sec ≈ 1.6 Gbps
Bandwidth (peak)        = 1,200 × 500 KB = 600 MB/sec ≈ 4.8 Gbps

Storage Estimation¶

Raw HTML storage/month  = 1B × 500 KB = 500 TB/month
Compressed (5:1 ratio)  = 100 TB/month
Yearly storage          = 1.2 PB/year
Metadata per page       = ~1 KB (URL, timestamp, hash, status)
Metadata total          = 1B × 1 KB = 1 TB
URL frontier size       = 10B URLs × 200 bytes = 2 TB

Infrastructure Sizing¶

Component	Estimate	Implication
Crawl workers	400 pages/sec ÷ 10 pages/sec/worker = 40 workers	40-50 VMs or containers
Bandwidth	1.6 Gbps average	Dedicated network, multiple ISP egress
Storage	100 TB/month compressed	Distributed storage (S3/HDFS)
URL frontier	2 TB	Distributed queue (Kafka or Redis cluster)
DNS cache	100M domains × 100 bytes = 10 GB	In-memory DNS resolver cache

Note

These estimates show why web crawling requires distributed architecture — no single machine can handle the bandwidth, storage, or compute requirements.

Step 3: High-Level Architecture¶

Core Components¶

flowchart TB subgraph Input["Input"] Seeds[Seed URLs] end subgraph Frontier["URL Frontier"] Queue[Priority Queue] Scheduler[Politeness Scheduler] end subgraph Fetcher["Fetcher Cluster"] F1[Fetcher 1] F2[Fetcher 2] FN[Fetcher N] DNS[DNS Resolver] Robots[robots.txt Cache] end subgraph Parser["Parser Cluster"] P1[Parser 1] P2[Parser 2] PN[Parser N] end subgraph Storage["Storage"] Content[(Content Store)] URLStore[(URL Store)] Dedup[Dedup Service] end Seeds --> Queue Queue --> Scheduler Scheduler --> F1 Scheduler --> F2 Scheduler --> FN F1 --> DNS F2 --> DNS FN --> DNS F1 --> Robots F2 --> Robots FN --> Robots F1 --> P1 F2 --> P2 FN --> PN P1 --> Content P2 --> Content PN --> Content P1 --> Dedup P2 --> Dedup PN --> Dedup Dedup --> Queue

Crawl Workflow¶

sequenceDiagram participant Frontier as URL Frontier participant Scheduler as Politeness Scheduler participant Fetcher participant DNS as DNS Resolver participant Web as Web Server participant Parser participant Storage as Content Store participant Dedup as Deduplication Frontier->>Scheduler: Get next URL Scheduler->>Scheduler: Check politeness (can we crawl this host now?) Scheduler->>Fetcher: URL to fetch Fetcher->>DNS: Resolve domain DNS-->>Fetcher: IP address Fetcher->>Web: HTTP GET Web-->>Fetcher: HTML content Fetcher->>Parser: Parse content Parser->>Parser: Extract text, links Parser->>Storage: Store content Parser->>Dedup: Check new URLs Dedup->>Dedup: Filter seen URLs Dedup->>Frontier: Add new URLs

Step 4: Component Deep Dive¶

4.1 URL Frontier¶

The frontier manages the queue of URLs to crawl. It's not a simple FIFO queue—it must handle: - Prioritization: Important pages first - Politeness: Don't hammer same host - Freshness: Re-crawl changed pages

Architecture:

flowchart TB subgraph Frontier2["URL Frontier"] Input[New URLs] Prioritizer[Prioritizer] subgraph PerHostQueues["Per-Host Queues"] Q1[example.com queue] Q2[other.com queue] QN[... queues] end Selector[Host Selector Round Robin] Output[Next URL] end Input --> Prioritizer Prioritizer --> Q1 Prioritizer --> Q2 Prioritizer --> QN Q1 --> Selector Q2 --> Selector QN --> Selector Selector --> Output

Implementation:

PythonJavaGo

from dataclasses import dataclass
from typing import Dict, List, Optional
import heapq
import time
from urllib.parse import urlparse

@dataclass
class URLEntry:
    url: str
    priority: float
    depth: int
    discovered_at: float

    def __lt__(self, other):
        return self.priority > other.priority  # Higher priority first

class URLFrontier:
    def __init__(self, politeness_delay: float = 1.0):
        self.host_queues: Dict[str, List[URLEntry]] = {}
        self.host_last_access: Dict[str, float] = {}
        self.politeness_delay = politeness_delay
        self.hosts_ready: List[tuple] = []  # (next_access_time, host)

    def add_url(self, url: str, priority: float = 1.0, depth: int = 0):
        """Add URL to appropriate host queue."""
        host = urlparse(url).netloc

        if host not in self.host_queues:
            self.host_queues[host] = []
            # New host is immediately ready
            heapq.heappush(self.hosts_ready, (0, host))

        entry = URLEntry(
            url=url,
            priority=priority,
            depth=depth,
            discovered_at=time.time()
        )
        heapq.heappush(self.host_queues[host], entry)

    def get_next_url(self) -> Optional[str]:
        """Get next URL respecting politeness."""
        now = time.time()

        while self.hosts_ready:
            next_time, host = heapq.heappop(self.hosts_ready)

            if next_time > now:
                # No host is ready yet
                heapq.heappush(self.hosts_ready, (next_time, host))
                return None

            if host in self.host_queues and self.host_queues[host]:
                entry = heapq.heappop(self.host_queues[host])

                # Schedule next access for this host
                next_access = now + self.politeness_delay
                heapq.heappush(self.hosts_ready, (next_access, host))

                self.host_last_access[host] = now
                return entry.url

        return None

import java.net.URI;
import java.net.http.HttpClient;
import java.net.http.HttpRequest;
import java.net.http.HttpResponse;
import java.time.Duration;
import java.util.ArrayList;
import java.util.List;
import java.util.Set;
import java.util.concurrent.ConcurrentHashMap;
import java.util.concurrent.PriorityBlockingQueue;
import java.util.regex.Matcher;
import java.util.regex.Pattern;

/**
 * URL Frontier backed by a priority queue with politeness enforcement.
 * In production, replace with Kafka-backed distributed frontier.
 */
public class URLFrontier {
    private final PriorityBlockingQueue<CrawlTask> queue;
    private final ConcurrentHashMap<String, Long> domainLastAccess;
    private final long politeDelayMs;

    public record CrawlTask(String url, int priority, int depth)
            implements Comparable<CrawlTask> {
        @Override
        public int compareTo(CrawlTask other) {
            return Integer.compare(this.priority, other.priority);
        }
    }

    public URLFrontier(long politeDelayMs) {
        this.queue = new PriorityBlockingQueue<>();
        this.domainLastAccess = new ConcurrentHashMap<>();
        this.politeDelayMs = politeDelayMs;
    }

    public void add(String url, int priority, int depth) {
        queue.offer(new CrawlTask(url, priority, depth));
    }

    public CrawlTask next() throws InterruptedException {
        while (true) {
            CrawlTask task = queue.take();
            String domain = URI.create(task.url()).getHost();
            long lastAccess = domainLastAccess.getOrDefault(domain, 0L);
            long elapsed = System.currentTimeMillis() - lastAccess;

            if (elapsed < politeDelayMs) {
                Thread.sleep(politeDelayMs - elapsed);
            }
            domainLastAccess.put(domain, System.currentTimeMillis());
            return task;
        }
    }

    public int size() { return queue.size(); }
}

/**
 * Crawler worker that fetches pages, extracts links, and publishes results.
 */
public class CrawlerWorker implements Runnable {
    private static final Pattern LINK_PATTERN =
        Pattern.compile("href=[\"'](https?://[^\"'\\s>]+)[\"']", Pattern.CASE_INSENSITIVE);

    private final URLFrontier frontier;
    private final HttpClient httpClient;
    private final Set<String> visited;
    private final int maxDepth;
    private volatile boolean running = true;

    public CrawlerWorker(URLFrontier frontier, Set<String> visited, int maxDepth) {
        this.frontier = frontier;
        this.visited = visited;
        this.maxDepth = maxDepth;
        this.httpClient = HttpClient.newBuilder()
            .connectTimeout(Duration.ofSeconds(10))
            .followRedirects(HttpClient.Redirect.NORMAL)
            .build();
    }

    @Override
    public void run() {
        while (running) {
            try {
                URLFrontier.CrawlTask task = frontier.next();
                if (task.depth() > maxDepth) continue;
                if (!visited.add(task.url())) continue;

                HttpResponse<String> response = httpClient.send(
                    HttpRequest.newBuilder()
                        .uri(URI.create(task.url()))
                        .timeout(Duration.ofSeconds(15))
                        .header("User-Agent", "SystemDesignBot/1.0")
                        .GET()
                        .build(),
                    HttpResponse.BodyHandlers.ofString()
                );

                if (response.statusCode() == 200) {
                    String body = response.body();
                    processPage(task.url(), body);
                    extractLinks(body, task.depth()).forEach(link ->
                        frontier.add(link, task.priority() + 1, task.depth() + 1));
                }
            } catch (Exception e) {
                // log and continue — one page failure shouldn't stop the crawler
            }
        }
    }

    private List<String> extractLinks(String html, int currentDepth) {
        List<String> links = new ArrayList<>();
        Matcher matcher = LINK_PATTERN.matcher(html);
        while (matcher.find()) {
            links.add(matcher.group(1));
        }
        return links;
    }

    private void processPage(String url, String content) {
        // store content, update index, emit to pipeline
    }

    public void shutdown() { running = false; }
}

package crawler

import (
    "context"
    "fmt"
    "io"
    "net/http"
    "net/url"
    "regexp"
    "sync"
    "time"
)

var linkPattern = regexp.MustCompile(`href=["'](https?://[^"'\s>]+)["']`)

type CrawlResult struct {
    URL        string
    StatusCode int
    Body       string
    Links      []string
    Duration   time.Duration
    Error      error
}

type Crawler struct {
    client         *http.Client
    visited        sync.Map
    frontier       chan string
    results        chan CrawlResult
    politeDelay    time.Duration
    domainLastHit  sync.Map
    maxDepth       int
    workerCount    int
}

func NewCrawler(workerCount int, politeDelay time.Duration, maxDepth int) *Crawler {
    return &Crawler{
        client: &http.Client{
            Timeout: 15 * time.Second,
            CheckRedirect: func(req *http.Request, via []*http.Request) error {
                if len(via) >= 5 {
                    return fmt.Errorf("too many redirects")
                }
                return nil
            },
        },
        frontier:    make(chan string, 100000),
        results:     make(chan CrawlResult, 10000),
        politeDelay: politeDelay,
        maxDepth:    maxDepth,
        workerCount: workerCount,
    }
}

func (c *Crawler) Start(ctx context.Context, seedURLs []string) <-chan CrawlResult {
    for _, u := range seedURLs {
        c.frontier <- u
    }

    var wg sync.WaitGroup
    for i := 0; i < c.workerCount; i++ {
        wg.Add(1)
        go func() {
            defer wg.Done()
            c.worker(ctx)
        }()
    }

    go func() {
        wg.Wait()
        close(c.results)
    }()

    return c.results
}

func (c *Crawler) worker(ctx context.Context) {
    for {
        select {
        case <-ctx.Done():
            return
        case rawURL, ok := <-c.frontier:
            if !ok {
                return
            }
            if _, loaded := c.visited.LoadOrStore(rawURL, true); loaded {
                continue
            }
            c.enforcePoliteness(rawURL)
            result := c.fetch(rawURL)
            c.results <- result

            if result.Error == nil {
                for _, link := range result.Links {
                    if _, seen := c.visited.Load(link); !seen {
                        select {
                        case c.frontier <- link:
                        default: // frontier full, drop URL
                        }
                    }
                }
            }
        }
    }
}

func (c *Crawler) fetch(rawURL string) CrawlResult {
    start := time.Now()

    req, err := http.NewRequest("GET", rawURL, nil)
    if err != nil {
        return CrawlResult{URL: rawURL, Error: err, Duration: time.Since(start)}
    }
    req.Header.Set("User-Agent", "SystemDesignBot/1.0")

    resp, err := c.client.Do(req)
    if err != nil {
        return CrawlResult{URL: rawURL, Error: err, Duration: time.Since(start)}
    }
    defer resp.Body.Close()

    body, _ := io.ReadAll(io.LimitReader(resp.Body, 5*1024*1024)) // 5MB max
    links := extractLinks(string(body), rawURL)

    return CrawlResult{
        URL:        rawURL,
        StatusCode: resp.StatusCode,
        Body:       string(body),
        Links:      links,
        Duration:   time.Since(start),
    }
}

func (c *Crawler) enforcePoliteness(rawURL string) {
    parsed, err := url.Parse(rawURL)
    if err != nil {
        return
    }
    domain := parsed.Host

    if lastHit, ok := c.domainLastHit.Load(domain); ok {
        elapsed := time.Since(lastHit.(time.Time))
        if elapsed < c.politeDelay {
            time.Sleep(c.politeDelay - elapsed)
        }
    }
    c.domainLastHit.Store(domain, time.Now())
}

func extractLinks(html, baseURL string) []string {
    matches := linkPattern.FindAllStringSubmatch(html, -1)
    var links []string
    for _, match := range matches {
        if len(match) > 1 {
            links = append(links, match[1])
        }
    }
    return links
}

4.2 Politeness and robots.txt¶

robots.txt tells crawlers what they can/can't access:

# Example robots.txt
User-agent: *
Disallow: /private/
Disallow: /admin/
Crawl-delay: 10

User-agent: Googlebot
Allow: /
Crawl-delay: 1

Implementation:

import urllib.robotparser
from functools import lru_cache
import asyncio
import aiohttp

class RobotsCache:
    def __init__(self, user_agent: str = "MyCrawler/1.0"):
        self.user_agent = user_agent
        self.cache: Dict[str, urllib.robotparser.RobotFileParser] = {}
        self.cache_ttl: Dict[str, float] = {}
        self.ttl = 3600  # Cache robots.txt for 1 hour

    async def fetch_robots(self, host: str) -> urllib.robotparser.RobotFileParser:
        """Fetch and parse robots.txt for a host."""
        robots_url = f"https://{host}/robots.txt"

        rp = urllib.robotparser.RobotFileParser()
        rp.set_url(robots_url)

        try:
            async with aiohttp.ClientSession() as session:
                async with session.get(robots_url, timeout=10) as response:
                    if response.status == 200:
                        content = await response.text()
                        rp.parse(content.splitlines())
                    else:
                        # No robots.txt = allow all
                        rp.parse([])
        except Exception:
            # Error fetching = allow all (fail open)
            rp.parse([])

        return rp

    async def can_fetch(self, url: str) -> bool:
        """Check if we're allowed to fetch this URL."""
        host = urlparse(url).netloc

        # Check cache
        now = time.time()
        if host in self.cache and self.cache_ttl.get(host, 0) > now:
            rp = self.cache[host]
        else:
            rp = await self.fetch_robots(host)
            self.cache[host] = rp
            self.cache_ttl[host] = now + self.ttl

        return rp.can_fetch(self.user_agent, url)

    def get_crawl_delay(self, host: str) -> float:
        """Get crawl delay for host."""
        if host in self.cache:
            delay = self.cache[host].crawl_delay(self.user_agent)
            return delay if delay else 1.0
        return 1.0  # Default 1 second

4.3 Fetcher¶

The fetcher downloads web pages. It must handle: - Concurrent connections: Maximize throughput - Timeouts: Don't wait forever - Retries: Handle transient failures - HTTP headers: Set User-Agent, handle redirects

Implementation:

import aiohttp
import asyncio
from dataclasses import dataclass
from typing import Optional
import hashlib

@dataclass
class FetchResult:
    url: str
    final_url: str  # After redirects
    status_code: int
    content_type: str
    content: bytes
    content_hash: str
    headers: dict
    fetch_time: float

class AsyncFetcher:
    def __init__(self, 
                 user_agent: str = "MyCrawler/1.0",
                 timeout: int = 30,
                 max_size: int = 10 * 1024 * 1024):  # 10MB
        self.user_agent = user_agent
        self.timeout = aiohttp.ClientTimeout(total=timeout)
        self.max_size = max_size

        self.headers = {
            "User-Agent": user_agent,
            "Accept": "text/html,application/xhtml+xml",
            "Accept-Language": "en-US,en;q=0.9",
            "Accept-Encoding": "gzip, deflate",
        }

    async def fetch(self, url: str) -> Optional[FetchResult]:
        """Fetch a URL and return content."""
        start_time = time.time()

        try:
            async with aiohttp.ClientSession(
                timeout=self.timeout,
                headers=self.headers
            ) as session:
                async with session.get(
                    url, 
                    allow_redirects=True,
                    max_redirects=5
                ) as response:
                    # Check content type
                    content_type = response.headers.get("Content-Type", "")
                    if not self._should_process(content_type):
                        return None

                    # Check content length
                    content_length = response.headers.get("Content-Length")
                    if content_length and int(content_length) > self.max_size:
                        return None

                    # Read content with size limit
                    content = await response.content.read(self.max_size)

                    return FetchResult(
                        url=url,
                        final_url=str(response.url),
                        status_code=response.status,
                        content_type=content_type,
                        content=content,
                        content_hash=hashlib.sha256(content).hexdigest(),
                        headers=dict(response.headers),
                        fetch_time=time.time() - start_time
                    )

        except asyncio.TimeoutError:
            logger.warning(f"Timeout fetching {url}")
            return None
        except aiohttp.ClientError as e:
            logger.warning(f"Error fetching {url}: {e}")
            return None

    def _should_process(self, content_type: str) -> bool:
        """Check if we should process this content type."""
        allowed = ["text/html", "application/xhtml+xml"]
        return any(ct in content_type for ct in allowed)

4.4 Parser¶

The parser extracts information from HTML: - Links: For further crawling - Text content: For indexing - Metadata: Title, description, etc.

Implementation:

from bs4 import BeautifulSoup
from urllib.parse import urljoin, urlparse
from dataclasses import dataclass
from typing import List, Set

@dataclass
class ParseResult:
    url: str
    title: str
    text: str
    links: List[str]
    metadata: dict

class HTMLParser:
    def __init__(self, allowed_domains: Set[str] = None):
        self.allowed_domains = allowed_domains

    def parse(self, url: str, html_content: bytes) -> ParseResult:
        """Parse HTML and extract content and links."""
        try:
            soup = BeautifulSoup(html_content, 'lxml')
        except Exception:
            soup = BeautifulSoup(html_content, 'html.parser')

        # Remove script and style elements
        for element in soup(['script', 'style', 'nav', 'footer', 'header']):
            element.decompose()

        # Extract title
        title = ""
        if soup.title:
            title = soup.title.string or ""

        # Extract text
        text = soup.get_text(separator=' ', strip=True)

        # Extract links
        links = self._extract_links(url, soup)

        # Extract metadata
        metadata = self._extract_metadata(soup)

        return ParseResult(
            url=url,
            title=title.strip(),
            text=text,
            links=links,
            metadata=metadata
        )

    def _extract_links(self, base_url: str, soup: BeautifulSoup) -> List[str]:
        """Extract and normalize links."""
        links = []

        for a_tag in soup.find_all('a', href=True):
            href = a_tag['href']

            # Skip certain links
            if href.startswith(('#', 'javascript:', 'mailto:', 'tel:')):
                continue

            # Convert to absolute URL
            absolute_url = urljoin(base_url, href)

            # Normalize
            normalized = self._normalize_url(absolute_url)
            if normalized:
                links.append(normalized)

        return list(set(links))  # Deduplicate

    def _normalize_url(self, url: str) -> Optional[str]:
        """Normalize URL for consistency."""
        try:
            parsed = urlparse(url)

            # Only HTTP(S)
            if parsed.scheme not in ('http', 'https'):
                return None

            # Check domain filter
            if self.allowed_domains:
                if parsed.netloc not in self.allowed_domains:
                    return None

            # Remove fragment
            normalized = f"{parsed.scheme}://{parsed.netloc}{parsed.path}"

            # Add query if present
            if parsed.query:
                normalized += f"?{parsed.query}"

            # Remove trailing slash for consistency
            normalized = normalized.rstrip('/')

            return normalized
        except Exception:
            return None

    def _extract_metadata(self, soup: BeautifulSoup) -> dict:
        """Extract meta tags."""
        metadata = {}

        for meta in soup.find_all('meta'):
            name = meta.get('name', meta.get('property', ''))
            content = meta.get('content', '')

            if name and content:
                metadata[name] = content

        return metadata

4.5 Deduplication¶

Avoid crawling the same content twice: - URL deduplication: Same URL seen before - Content deduplication: Same content, different URL

URL Bloom Filter:

from bitarray import bitarray
import mmh3

class BloomFilter:
    """Memory-efficient probabilistic set membership."""

    def __init__(self, expected_items: int, false_positive_rate: float = 0.01):
        # Calculate optimal size and hash count
        self.size = self._optimal_size(expected_items, false_positive_rate)
        self.hash_count = self._optimal_hash_count(self.size, expected_items)
        self.bit_array = bitarray(self.size)
        self.bit_array.setall(0)

    def _optimal_size(self, n: int, p: float) -> int:
        """Calculate optimal bit array size."""
        import math
        return int(-n * math.log(p) / (math.log(2) ** 2))

    def _optimal_hash_count(self, m: int, n: int) -> int:
        """Calculate optimal number of hash functions."""
        import math
        return max(1, int(m / n * math.log(2)))

    def _get_hash_values(self, item: str) -> List[int]:
        """Generate hash values for item."""
        hashes = []
        for i in range(self.hash_count):
            hash_val = mmh3.hash(item, i) % self.size
            hashes.append(hash_val)
        return hashes

    def add(self, item: str):
        """Add item to bloom filter."""
        for hash_val in self._get_hash_values(item):
            self.bit_array[hash_val] = 1

    def might_contain(self, item: str) -> bool:
        """Check if item might be in set (can have false positives)."""
        return all(
            self.bit_array[hash_val] 
            for hash_val in self._get_hash_values(item)
        )

class URLDeduplicator:
    def __init__(self, expected_urls: int = 1_000_000_000):
        # For 1 billion URLs with 1% FP rate = ~1.2GB memory
        self.bloom = BloomFilter(expected_urls, 0.01)
        self.seen_count = 0

    def is_new(self, url: str) -> bool:
        """Check if URL is new (not seen before)."""
        normalized = self._normalize(url)

        if self.bloom.might_contain(normalized):
            return False

        self.bloom.add(normalized)
        self.seen_count += 1
        return True

    def _normalize(self, url: str) -> str:
        """Normalize URL for deduplication."""
        # Remove common tracking parameters
        parsed = urlparse(url)
        # ... normalization logic
        return url.lower()

Content Fingerprinting (SimHash):

For detecting near-duplicate content:

import simhash

class ContentDeduplicator:
    def __init__(self, threshold: int = 3):
        self.threshold = threshold
        self.fingerprints: Dict[int, str] = {}

    def get_fingerprint(self, text: str) -> int:
        """Generate simhash fingerprint for text."""
        return simhash.Simhash(text).value

    def is_duplicate(self, text: str) -> Optional[str]:
        """Check if content is duplicate of existing page."""
        fp = self.get_fingerprint(text)

        for existing_fp, url in self.fingerprints.items():
            distance = self._hamming_distance(fp, existing_fp)
            if distance <= self.threshold:
                return url  # Return URL of duplicate

        return None

    def add(self, text: str, url: str):
        """Add content fingerprint."""
        fp = self.get_fingerprint(text)
        self.fingerprints[fp] = url

    def _hamming_distance(self, a: int, b: int) -> int:
        """Count differing bits between two integers."""
        return bin(a ^ b).count('1')

4.6 DNS Resolution¶

DNS resolution can be a bottleneck. Cache results:

import socket
import time
from typing import Dict, Tuple
import asyncio

class DNSCache:
    def __init__(self, ttl: int = 3600):
        self.cache: Dict[str, Tuple[str, float]] = {}
        self.ttl = ttl

    async def resolve(self, hostname: str) -> Optional[str]:
        """Resolve hostname to IP, with caching."""
        now = time.time()

        # Check cache
        if hostname in self.cache:
            ip, expiry = self.cache[hostname]
            if expiry > now:
                return ip

        # Resolve
        try:
            loop = asyncio.get_event_loop()
            result = await loop.getaddrinfo(
                hostname, None, 
                family=socket.AF_INET,
                type=socket.SOCK_STREAM
            )
            ip = result[0][4][0]

            # Cache
            self.cache[hostname] = (ip, now + self.ttl)
            return ip
        except socket.gaierror:
            return None

Step 5: Distributed Architecture¶

For crawling billions of pages, we need a distributed system.

Architecture¶

flowchart TB subgraph Coordinator["Coordinator Cluster"] C1[Coordinator 1] C2[Coordinator 2] ZK[(ZooKeeper)] end subgraph Frontier3["URL Frontier"] Kafka[(Kafka)] Redis[(Redis)] end subgraph CrawlWorkers["Crawler Workers"] W1[Worker 1 Fetcher + Parser] W2[Worker 2 Fetcher + Parser] W3[Worker N Fetcher + Parser] end subgraph StorageLayer["Storage Layer"] S3[(S3/HDFS Content)] Cassandra[(Cassandra URL Metadata)] end C1 --> ZK C2 --> ZK C1 --> Kafka C2 --> Kafka Kafka --> W1 Kafka --> W2 Kafka --> W3 W1 --> Redis W2 --> Redis W3 --> Redis W1 --> S3 W2 --> S3 W3 --> S3 W1 --> Cassandra W2 --> Cassandra W3 --> Cassandra W1 --> Kafka W2 --> Kafka W3 --> Kafka

URL Distribution Strategy¶

Use consistent hashing to assign URLs to workers:

import hashlib

class URLDistributor:
    def __init__(self, num_workers: int):
        self.num_workers = num_workers

    def get_worker(self, url: str) -> int:
        """Determine which worker should handle this URL."""
        # Hash by domain for politeness locality
        domain = urlparse(url).netloc
        hash_val = int(hashlib.md5(domain.encode()).hexdigest(), 16)
        return hash_val % self.num_workers

Why hash by domain? - Same domain → same worker - Easier to enforce politeness (one worker per domain) - Better connection reuse

Kafka for URL Queue¶

from kafka import KafkaProducer, KafkaConsumer

class DistributedFrontier:
    def __init__(self, num_partitions: int):
        self.producer = KafkaProducer(
            bootstrap_servers=['kafka:9092'],
            value_serializer=lambda v: json.dumps(v).encode()
        )

        self.consumer = KafkaConsumer(
            'url-frontier',
            group_id='crawlers',
            bootstrap_servers=['kafka:9092'],
            value_deserializer=lambda m: json.loads(m.decode())
        )

        self.num_partitions = num_partitions

    def add_url(self, url: str, priority: float):
        """Add URL to frontier."""
        domain = urlparse(url).netloc
        partition = self._get_partition(domain)

        self.producer.send(
            'url-frontier',
            key=domain.encode(),
            value={'url': url, 'priority': priority},
            partition=partition
        )

    def _get_partition(self, domain: str) -> int:
        """Assign domain to partition."""
        return hash(domain) % self.num_partitions

Handling Failures¶

class CrawlWorker:
    def __init__(self):
        self.fetcher = AsyncFetcher()
        self.parser = HTMLParser()
        self.retry_queue = []
        self.max_retries = 3

    async def process_url(self, url: str, retry_count: int = 0):
        """Process URL with retry logic."""
        try:
            result = await self.fetcher.fetch(url)

            if result is None:
                return

            if result.status_code == 200:
                await self.handle_success(url, result)
            elif result.status_code in (301, 302):
                await self.handle_redirect(url, result)
            elif result.status_code == 429:
                await self.handle_rate_limit(url)
            elif result.status_code >= 500:
                await self.handle_server_error(url, retry_count)

        except Exception as e:
            logger.error(f"Error processing {url}: {e}")
            if retry_count < self.max_retries:
                await self.schedule_retry(url, retry_count + 1)

    async def handle_server_error(self, url: str, retry_count: int):
        """Retry with exponential backoff."""
        if retry_count < self.max_retries:
            delay = 2 ** retry_count  # 1, 2, 4 seconds
            await asyncio.sleep(delay)
            await self.schedule_retry(url, retry_count + 1)

Step 6: Freshness and Re-crawling¶

Not all pages change at the same rate. Prioritize re-crawling accordingly.

Change Detection¶

from datetime import datetime, timedelta

class FreshnessManager:
    def __init__(self):
        self.change_history: Dict[str, List[datetime]] = {}

    def record_change(self, url: str, changed: bool):
        """Record whether page changed since last crawl."""
        if url not in self.change_history:
            self.change_history[url] = []

        if changed:
            self.change_history[url].append(datetime.now())

    def estimate_change_frequency(self, url: str) -> timedelta:
        """Estimate how often this page changes."""
        history = self.change_history.get(url, [])

        if len(history) < 2:
            return timedelta(days=7)  # Default: weekly

        # Calculate average time between changes
        intervals = [
            history[i] - history[i-1] 
            for i in range(1, len(history))
        ]
        avg_interval = sum(intervals, timedelta()) / len(intervals)

        return avg_interval

    def calculate_recrawl_priority(self, url: str, last_crawl: datetime) -> float:
        """Calculate priority for re-crawling."""
        expected_change = self.estimate_change_frequency(url)
        time_since_crawl = datetime.now() - last_crawl

        # Higher priority if we're overdue for a crawl
        return time_since_crawl / expected_change

ETags and Conditional Requests¶

Use HTTP headers to detect changes without downloading:

async def fetch_if_modified(self, url: str, etag: str = None, 
                            last_modified: str = None) -> Optional[FetchResult]:
    """Fetch only if content changed."""
    headers = dict(self.headers)

    if etag:
        headers["If-None-Match"] = etag
    if last_modified:
        headers["If-Modified-Since"] = last_modified

    async with aiohttp.ClientSession() as session:
        async with session.get(url, headers=headers) as response:
            if response.status == 304:
                # Not modified
                return None

            # Content changed, process as usual
            return await self._process_response(url, response)

Step 7: Handling JavaScript-Rendered Content¶

Modern sites use JavaScript to render content. Options:

Option 1: Headless Browser¶

from playwright.async_api import async_playwright

class JavaScriptRenderer:
    def __init__(self):
        self.playwright = None
        self.browser = None

    async def start(self):
        """Initialize headless browser."""
        self.playwright = await async_playwright().start()
        self.browser = await self.playwright.chromium.launch(
            headless=True,
            args=['--no-sandbox', '--disable-dev-shm-usage']
        )

    async def render(self, url: str, wait_time: int = 5) -> str:
        """Render page with JavaScript."""
        page = await self.browser.new_page()

        try:
            await page.goto(url, wait_until='networkidle')
            await page.wait_for_timeout(wait_time * 1000)

            content = await page.content()
            return content
        finally:
            await page.close()

    async def stop(self):
        """Cleanup."""
        if self.browser:
            await self.browser.close()
        if self.playwright:
            await self.playwright.stop()

Trade-offs:

Approach	Speed	Resource Usage	Coverage
HTTP fetch	Fast	Low	Static content only
Headless browser	Slow (5-10s)	High	Full JS rendering

Recommendation: Use HTTP fetch by default, headless browser for known JS-heavy sites.

Option 2: Prerendering Services¶

Use services like Prerender.io or Rendertron
They cache rendered pages
Reduces load on your infrastructure

Step 8: Storage Design¶

Content Storage¶

Store raw HTML in object storage:

import boto3
import gzip

class ContentStore:
    def __init__(self, bucket: str):
        self.s3 = boto3.client('s3')
        self.bucket = bucket

    def store(self, url: str, content: bytes, metadata: dict) -> str:
        """Store crawled content."""
        # Generate key from URL hash
        url_hash = hashlib.sha256(url.encode()).hexdigest()
        key = f"crawl/{url_hash[:2]}/{url_hash[2:4]}/{url_hash}"

        # Compress content
        compressed = gzip.compress(content)

        self.s3.put_object(
            Bucket=self.bucket,
            Key=key,
            Body=compressed,
            Metadata={
                'url': url,
                'content-encoding': 'gzip',
                **metadata
            }
        )

        return key

    def retrieve(self, key: str) -> bytes:
        """Retrieve stored content."""
        response = self.s3.get_object(Bucket=self.bucket, Key=key)
        compressed = response['Body'].read()
        return gzip.decompress(compressed)

URL Metadata Storage¶

Use Cassandra for URL metadata (billions of rows):

from cassandra.cluster import Cluster

class URLMetadataStore:
    def __init__(self, keyspace: str):
        self.cluster = Cluster(['cassandra1', 'cassandra2', 'cassandra3'])
        self.session = self.cluster.connect(keyspace)

        # Prepared statements
        self.insert_stmt = self.session.prepare("""
            INSERT INTO urls (url_hash, url, last_crawl, content_hash, 
                              etag, status_code, crawl_count)
            VALUES (?, ?, ?, ?, ?, ?, ?)
        """)

        self.get_stmt = self.session.prepare("""
            SELECT * FROM urls WHERE url_hash = ?
        """)

    def save(self, url: str, crawl_result: dict):
        """Save URL metadata."""
        url_hash = hashlib.sha256(url.encode()).hexdigest()

        self.session.execute(self.insert_stmt, (
            url_hash,
            url,
            datetime.now(),
            crawl_result.get('content_hash'),
            crawl_result.get('etag'),
            crawl_result.get('status_code'),
            crawl_result.get('crawl_count', 1)
        ))

    def get(self, url: str) -> Optional[dict]:
        """Get URL metadata."""
        url_hash = hashlib.sha256(url.encode()).hexdigest()
        row = self.session.execute(self.get_stmt, (url_hash,)).one()
        return dict(row._asdict()) if row else None

Step 9: Multi-Language Implementations¶

Python, Java, and Go reference implementations for the URL frontier and crawler worker (including a concurrent Go crawler) live under §4.1 URL Frontier in Implementation.

Step 10: Monitoring and Metrics¶

Key Metrics¶

Metric	Description	Alert Threshold
`pages_crawled_per_second`	Throughput	Drop > 20%
`crawl_error_rate`	Failed fetches	> 5%
`queue_depth`	URLs waiting	Growing consistently
`avg_fetch_latency`	Time to download	> 5s
`robots_blocked_rate`	Blocked by robots.txt	-
`duplicate_rate`	Already-seen URLs	-

Implementation¶

import prometheus_client as prom

# Metrics
pages_crawled = prom.Counter(
    'crawler_pages_total',
    'Total pages crawled',
    ['status']
)

fetch_latency = prom.Histogram(
    'crawler_fetch_latency_seconds',
    'Time to fetch pages',
    buckets=[0.1, 0.5, 1, 2, 5, 10, 30]
)

queue_depth = prom.Gauge(
    'crawler_queue_depth',
    'URLs in queue'
)

# Usage
@fetch_latency.time()
async def fetch_page(url: str):
    result = await fetcher.fetch(url)
    pages_crawled.labels(status=result.status_code).inc()
    return result

Interview Checklist¶

Sample Interview Dialogue¶

Interviewer: "Design a web crawler."

You: "Interesting! Let me clarify a few things. What's the purpose—are we building a search engine crawler or something more focused? And what's the scale—thousands of pages or billions?"

Interviewer: "Let's say we're building a general web crawler for a search engine, targeting billions of pages."

You: "Got it. At that scale, we need a distributed system. Let me walk through the main components.

The core is the URL Frontier—it's not just a queue but a prioritized, politeness-aware system. We need to ensure we don't overwhelm any single website while maximizing overall throughput.

For politeness, we respect robots.txt and maintain per-host rate limits. I'd organize the frontier as per-host queues with a round-robin selector, so we spread requests across many hosts.

For scale, we'd partition the work by domain—hash each domain to a worker. This keeps politeness enforcement local to each worker and enables connection reuse.

Deduplication is critical at this scale. I'd use a Bloom filter for URL deduplication—for a billion URLs with 1% false positive rate, that's about 1.2GB of memory. For content deduplication, we'd use SimHash to detect near-duplicates.

Want me to dive deeper into any component?"

Summary¶

Component	Technology	Rationale
URL Frontier	Kafka + Redis	Distributed queue with prioritization
Politeness	Per-host queues	Respect rate limits
Fetching	aiohttp	Async, high concurrency
Deduplication	Bloom filter + SimHash	Memory-efficient
Content Storage	S3/HDFS	Scalable object storage
Metadata	Cassandra	Billions of rows, fast writes
Coordination	ZooKeeper	Leader election, config

A web crawler at scale is a fascinating distributed systems problem. The key challenges are politeness, deduplication, and freshness—and each requires careful design to handle billions of pages efficiently.

Design a Web Crawler¶

What We're Building¶

Why Web Crawlers Matter¶

The Scale Challenge¶

Step 1: Requirements Clarification¶

Questions to Ask¶

Functional Requirements¶

Non-Functional Requirements¶

Scope for This Design¶

Technology Selection & Tradeoffs¶

URL frontier¶

Content storage¶

URL deduplication¶

DNS resolution¶

HTML parsing¶

Our choice¶

CAP Theorem Analysis¶

How to frame it in an interview¶

Per-component view¶

Diagram¶

SLA and SLO Definitions¶

Service level indicators (SLIs) and objectives (SLOs)¶

Error budget policy¶

Database Schema¶

url_frontier (work queue + scheduling)¶

crawled_pages (fetch outcome + pointer to blob)¶

link_graph (optional, for ranking / debugging)¶

API Design¶

Overview¶

POST /v1/seeds¶

GET /v1/crawls/{crawl_job_id}¶

PUT /v1/hosts/{host}/politeness¶

GET /v1/results¶

Error model (consistent)¶

Step 2: Back-of-Envelope Estimation¶

Scale Assumptions¶

Traffic Estimation¶

Bandwidth Estimation¶

Storage Estimation¶

Infrastructure Sizing¶

Step 3: High-Level Architecture¶

Core Components¶

Crawl Workflow¶

Step 4: Component Deep Dive¶

4.1 URL Frontier¶

4.2 Politeness and robots.txt¶

4.3 Fetcher¶

4.4 Parser¶

4.5 Deduplication¶

4.6 DNS Resolution¶

Step 5: Distributed Architecture¶

Architecture¶

URL Distribution Strategy¶

Kafka for URL Queue¶

Handling Failures¶

Step 6: Freshness and Re-crawling¶

Change Detection¶

ETags and Conditional Requests¶

Step 7: Handling JavaScript-Rendered Content¶

Option 1: Headless Browser¶

Option 2: Prerendering Services¶

Step 8: Storage Design¶

Content Storage¶

URL Metadata Storage¶

Step 9: Multi-Language Implementations¶

Step 10: Monitoring and Metrics¶

Key Metrics¶

Implementation¶

Interview Checklist¶

Sample Interview Dialogue¶

Summary¶

`url_frontier` (work queue + scheduling)¶

`crawled_pages` (fetch outcome + pointer to blob)¶

`link_graph` (optional, for ranking / debugging)¶

`POST /v1/seeds`¶

`GET /v1/crawls/{crawl_job_id}`¶

`PUT /v1/hosts/{host}/politeness`¶

`GET /v1/results`¶