This article walks through how to approach a Web Crawler system design problem in interviews, including scale estimation, architectural choices, and trade-offs.

This system design problem comes quite a lot in the interviews – and for a reason – this simple high level problem comes in many different flavours and has the potential to get deeper-and-deeper and it gets harder and harder. On the surface, a web crawler looks simple – given a bunch of seed urls, crawler will visit each url and using breadth first traversal method to explore all the other urls till the max depth of N. But in reality it is much more nuanced than this.

A web crawler is known as a robot or spider and it is used by search engines to discover new content on the webpage. Content is not necessarily a web page, it could be image, audio, pdf, file etc…

A web crawler usually starts by a few seed urls and then it collects more urls by visiting those seed urls and so on. Here’s a visual for you:

The complexity of a web crawler depends on its scale we intend to support. It could be quite simple for a school level project that would take a few hours to build or it could be at a gigantic scale like Google or Microsoft that requires continuous development with a dedicated engineering team. In this article we will discuss how to build a distributed fault tolerant Web Crawler.

Table of Contents

Problem Statement

Design a service which receives input as list of URLs, scrapes those URLs for links to other pages and references to
images (i.e. the src component of img tags), then returns a mapping of page URLs to a list of image URLs.

Your service does not need to download and store the images.
Your service should follow links to other pages from the original submitted pages, and return the images on those 2/3/nth
level pages as if they were on the first level page.

The API contract is defined as:

POSTing to /jobs with a body of a JSON array of URLs to start scraping from (e.g. ["https://google.com", "https://www.statuspage.io"]) should return a job identifier of some kind.
GETing /jobs/:job_id/status with the returned job identifier should return a JSON object of the format of {"completed": x, "in_progress": y} where x is the number of original URLs which have been completely crawled and y is the number of original URLs which are still being crawled.
GETing /jobs/:job_id/results with the returned job identifier should return a JSON object returning a mapping of original URL to all reachable images from that original URL, in the format of:

{
    "https://google.com": [
      "https://google.com/images/logo_sm_2.gif",
      "https://google.com/images/warning.gif"
    ],
    "https://www.statuspage.io": [
      "https://statuspage.io/logo.png",
      "https://statuspage.io/other-logo.png"
    ]
}

Approach

When you face this kind of problem statement in an interview then you should not jump to the solution straightaway. Because on the surface it looks like a Web Crawler problem but its not exactly a web crawler only. It contains a lot of nuances related to the API for submitting urls, fetching the status of the job based on the urls and also a result part associated with it.

All of these requirements points towards a job queuing and processing problem rather than just the web crawler.

Always take your time and start discussing non-functional requirements. This not only gives you some time to think about the problem, but also helps you to calm your nerve while your brain is working/planning towards a solution.

I always try to come up with some random numbers to realize the scale of the problem and then go to a high-level design.

My Approach

I follow a 7 step framework for my system design problem

flowchart LR
    RealizeScale["Realize the Scale"] --> DefineFeatures["Define High Level Features"]
    DefineFeatures --> HLD["Propose High Level Design and Get Buy-In"]
    HLD --> CoverFeatures["Cover Each Listed Feature"]
    CoverFeatures --> DiveDeep["Dive Deep"]
    DiveDeep --> PenDownTradeOffs["Pen Down Trade-Offs"]

Let me explain each step:

Realize The Scale
- You need to make sure at what scale are you designing this system.
- Come up with some numbers
- Do back of the envelop calculations
- More often than not, these calculations comes later when you are discussing design trade-offs with your interviewer.
Define High Level Features
- Always create a list of features that you will be tackling in the interview
- Writing down the features helps you and the interviewer to be on the same page
- Make sure your features are concrete and un-ambigous.
- Discuss with your interviewer to prioritise the features for the interview
- Remember, you cannot cover everything
Propose High Level Design and Get Buy In
- Always start with the high level design first that covers or touches all your features broadly
- After drafting your first HLD, make sure to get the buy in from your interviewer
- Make sure you are driving the interview in the right direction as the interviewer expects
- Do not directly start diving deep into the nitty gritty, take your interviewer opinion to expand on the sub-sections of your HLD
Cover Each Listed Feature
- Make sure you have covered all the features broadly in your first HLD
- This will give you more room to dive deep into specific sub-sections
Dive Deep
- You are the driver of the interview but take directions from your interviewer
- Ask for their opinion as to which sub-section they want to explore first
- Try to expand the sub-section in the same HLD, its easier to continue your chain of thoughts when everything is on the same location
- Build over your solution step-by-step, do not leave anything unambigous or unexplained.
Pen Down Trade Offs
- I always write the trade-offs right there on the diagram to make it evident
- It often helps your interviewer to follow along easily as well and if they are not happy, they can point and challenge your design
- This opens up opportunities to discuss both sides of the design
- It takes a few seconds to write the trade-offs while you are explaining your design which can help you and the interviewer a lot
- Always goes in your favour when things are all written on the board

I hope this framework will help you as well.

So, let’s begin with back of the envelop calculation.

Back of the Envelop Calculation

Assume 1 Billion Web Pages are downloaded every month
QPS: 2^30 / 30 / 24 / 3600 = 400 Approx. pages per second
Peak: 2 * QPS = 800
Assume average web page size is 500k
1 billion page x 500 = 500 TB storage per month

You don’t have to spend a lot of time into it. The above numbers are sufficient to realize the scale we are designing for.

Propose High-Level Design and Get a Buy In

Once the requirements and features are clear, let’s begin with the high level design. Remember, keep it simple and explain your thought process to the interviewer. For this problem start with user and try to complete the end-to-end flow.

---
title: High Level Design
---
flowchart TD
    subgraph "Navigate to the website"
        direction LR
        user["User"] -->|"bemyaficionado.com"| dns["DNS"]
        dns -->|10.10.138.23| user
    end
    subgraph "Load Balancer"
        direction TB
        user --> lb["Load Balancer"]
        lb --> job_service_write["Job Service (write)"]
        lb --> job_service_read["Job Service (read)"]
    end
    subgraph "Submit Job"
        job_service_write --> |Publish Job| jobs_queue[["Jobs Queue"]]
        jobs_queue -->|ACK| job_service_write
        job_service_write -->|Job ID| lb
        jobs_queue -->|Poll Job| jobs_consumer["Job Consumer"]
    end

    job_service_read -->|get job status by Job ID| jobs_db_replicas
    job_service_read -->|get job result by Job ID| jobs_db_replicas
    
    subgraph "Jobs DB"
        jobs_consumer --> jobs_db[("Jobs DB")]
        jobs_db --> jobs_db_replicas[("Replicas")]
    end

    subgraph "URL Job Processor Flow"
    jobs_consumer -->|"publish individual url jobs"| url_frontier[["URL Frontier Queue"]]
    url_frontier -->|Deque<br>Job Url|url_job_processor["URL Job Processor"]
    url_job_processor -->|"Process each URL Job"| url_job_processor
    url_job_processor -->|"Publish next level<br>unique url jobs"| url_frontier
    end

    url_job_processor -->|"Create next level jobs"| jobs_db

This is a good starting point. This clearly illustrates how the request will be sent by the user to the servers and stored in the database and back.

You will talk about the Load Balancer and how it distributes load among your job services.
Each job service is stateless so it can scale horizontally.
For performance and reliability you have introduced a Jobs Queue to hold the jobs.
- This is a good time to discuss a little bit about which queue would you prefer, I usually go with Kafka Queue because it also acts as a storage and retention for your messages with quite some features which you can call out in your design if needed.
You can talk about the type of database you would want to choose. In this case, I prefer relational database to store the jobs because it provides transactional capabilities which might be used by job workers when they will be processing individual urls.
You also draw the Read Replicas, however, this may change based on the problem. Here, you can talk about the eventual consistency between the write and reads to the jobs table.
Another important bit here – DO NOT put too much detail in the HLD, for example, URL Job Processor Flow. Here I just listed down the high level component and flow and didn’t go deeper into the individual components. And that is for a reason. It helps you to first clear the end-to-end understanding and also get a buy in on the high level component design.

This is a good starting point. As you can see we have broadly touched upon all the features that we listed out in the beginning.

Now, immediately after you have got the buy-in, start expanding on the important components. Here the most important component that we have not expanded is the URL Job Processor Flow. So, let’s expand on that component next in your deep-dive.

Dive Deep into Individual Component Design

URL Job Processor Flow

This is quite an important piece of the puzzle, this is where you complete the entire story,

how you are going to process each url and expand on the levels?
What algorithm will you use for traversing the web pages and why?
How will you perform the de-duplication logic on urls and content?
What data will be stored in the database for it to continue process the next set of urls? etc…

So, let’s dive in

---
title: Job Processor Workflow
---
flowchart TD
    job_worker["Job Worker"] -->|"publish job url"| url_frontier["URL Frontier"]
    url_frontier -->|"poll job url"| job_processor["Job URL Worker"]
    job_processor -->|"check url"| check_duplicate_url{"has seen?"}
    subgraph "Internal Workflow"
    direction LR
        check_duplicate_url -->|"No"| download_html[["Download HTML"]]
        check_duplicate_url -->|"Yes"| mark_job_complete
        download_html -->|"check content"| is_content_duplicate{"has seen?"}
        is_content_duplicate --> |"No"| enqueue_url[["Enqueue URL"]]
        is_content_duplicate --> |"Yes"| mark_job_complete[["Mark Job Complete"]]
    end

    enqueue_url -->|publish url| url_frontier
    mark_job_complete --> |"update job status"| job_url_db[("Job URL DB")]

Job Worker publishes the Job Url onto the URL Frontier which is a Kafka Queue
Job URL Worker deques/polls the queue to fetch the latest job entries from the queue and starts processing each url.
- This is where the concept of Kafka Consumer Group can help you expand on the parallelisation of the work with multiple consumers.
- There could be N consumers, each processing a different URL.
- The N is decided based on the throughput of the system.
URL Worker first checks if this URL has been seen or not.
- This is where you will discuss about the datastructure or mini architecture for de-duplicating the URLs for the job.
- I would suggest go with a cache design because it requires fast insertion and checks and its lifecycle is only till the job is IN_PROGRESS.
Download HTML step downloads the url fetched from the queue and passes for de-deduplication.
- This is where you will discuss about the hashing strategy for identifying the duplicates. Because you cannot go around check character by character in your entire database.
Enqueue next level url jobs on to the queue so that the process continues until it has reached the K max depth or it has run out of URLs to process.

If you could discuss the above flow with your interviewer. I hope he would appreciate the details and the effort you took to paint that out.

Next missing puzzle is the Jobs DB schema. We have been talking about Job and then individual URL Jobs for a while now. This would be the best time to expand on the database schema for your tables.

Jobs DB Schema Design

Think carefully about the table design in order to serve the requirement. The requirement is that user must see how many urls have been completed and how many are pending. This hints us to have status per url submitted by the user.

This means we would need at-least 2 tables to keep track of all the URLs that will be fetched by traversing the initial seed urls. Plus we also need to maintain the max depth for the recursion otherwise the job will go on and on for a long long time. All the initial requirements will go into the Jobs Table and all the dynamic individual url job details will be stored in the JobUrl Table.

(please bear with my noob naming skills)

Let’s start designing our table schema:

erDiagram
    Job {
        job_id string PK
        status enum "IN_PROGRESS | COMPLETED | FAILED"
        max_depth integer
        created_at timestamp
        updated_at timestamp
    }
    
    JobUrl {
        url string PK
        url_type enum "URL | IMAGE"
        job_id string FK
        status enum "PENDING | IN_PROGRESS | COMPLETED | FAILED"
        depth integer
        parent_url_id integer FK
        created_at timestamp
        updated_at timestamp
    }
    
    Job ||--|{JobUrl: contains

A naive approach would be as following:

The Jobs Service (read) can query the database everytime to fetch the count of the completed urls and pending urls based on the Job ID

SELECT status, count(1) FROM job_urls
WHERE job_id = '{job_id}'
GROUP BY status

The above query will give you the required status that’s needed.

Status	Count
PENDING	0
FAILED	0
IN_PROGRESS	5
COMPLETED	5

However, this naive query everytime solution is not scalable at all. But it is still good to point this out, even though the solution is not scalable it gives us an idea and confidence that the current table design can serve the required data.

Now, whenever your system is encountering thousands of requests per second then hitting the main database for small updates doesn’t make sense. It can very quickly overload your database, make it slow and increase your cost. Instead, think about introducing a caching layer that could return the result much faster and cheaply.

At this point, I just realized that we have not taken caching into consideration while designing our high-level design. And that is completely normal, infact, it is quite good. Because now you can demonstrate your understanding in identifying the bottlenecks and can suggest improvements. It would always be great if this comes from your side instead of the interviewer. It shows you are constantly thinking about the problem.

---
title: Update job status count in Cache
---
flowchart TD
    job_worker["Job Worker"] -->|"publish job url"| url_frontier["URL Frontier"]
    url_frontier -->|"poll job url"| job_processor["Job URL Worker"]
    job_processor -->|update job status count<br>IN_PROGRESS: 1<br>COMPLETED: 0| cache[("Redis Cache")]
    job_processor -.update job status.-> job_url_db
    job_processor -->|"check url"| check_duplicate_url{"has seen?"}
    
    subgraph "Internal Workflow"
    direction LR
        check_duplicate_url -->|"No"| download_html[["Download HTML"]]
        check_duplicate_url -->|"Yes"| mark_job_complete
        download_html -->|"check content"| is_content_duplicate{"has seen?"}
        is_content_duplicate --> |"No"| enqueue_url[["Enqueue URL"]]
        is_content_duplicate --> |"Yes"| mark_job_complete[["Mark Job Complete"]]
    end

    enqueue_url -->|publish url| url_frontier
    mark_job_complete --> |"update job status"| job_url_db[("Job URL DB")]
    mark_job_complete --> |update job status count<br>IN_PROGRESS: 0<br>COMPLETED: 1| cache


    linkStyle 2 stroke:green,stroke-width:5px;
    linkStyle 12 stroke:green,stroke-width:5px;

After with this the Job Service can directly query the cache with the Job ID to fetch the result,

---
title: Get Job Status
---
flowchart TD
    user -->|"GET /jobs/{job_id}/status"| job_service
    job_service["Job Service(read)"] -->cache
    cache --> job_service
    job_service -->|<br>IN_PROGRESS: 0<br>COMPLETED: 1| user["User"]

It would good to quickly describe your Key and Value structure so there’s no doubt in the storage and retrieval.

Cache key value here can be quite straightforward. I would just prefix the job id with the status. Example:

Key: {job_id}::{status}
Value: Integer

Example

{
    "job001::IN_PROGRESS": 10,
    "job001::COMPLETED": 5,
    "job001::CURR_DEPTH": 2
}

Conclusion

At this point we have covered almost everything in detail. Building a distributed fault tolerant web crawler is not an easy job. Especially its not feasible to cover everything or think of everything in a 60 mins interview. The best you can do is to try to cover the high level components and illustrate how it will solve the problem by highlighting all tradeoffs the best you can. Remember there’s no right or wrong answer in a system design interview. It’s all about the trade-offs.

Focus on how well you can express or communicate your point. And for that if you have to write down a pseudo-code, that is fine too. As long as it helps you to explain your point and express your thought process.

Before ending this article, I want to bring everything that we developed together in a single design so that the scale of the project is evident.

Currently, I have not covered many other technical subjects that you might want to do with your interviewer, these improvements would strengthen the technical rigor:

Content deduplication:
I suggested hashing for content de-deuplication but didn’t discussed:
- which hash (MD5, SHA-256, SimHash?)
- cost vs collision
- trade-offs for HTML normalization (e.g., script/style stripping).
  This is a missed opportunity to show depth.
Crawl politeness: We didn’t discussed `robots.txt`, rate limiting, per-domain queues – all relevant in real crawlers. Even if simplified for interviews, mentioning why you’re ignoring them strengthens credibility.
Failure semantics:
- The design diagram includes FAILED state but we never explained what causes failure or how retries/poison messages are handled.
Job lifecycle:
- What defines job completion?

Addressing these would significantly elevate the technical accuracy.

I hope this article will give you insights into tackling system design problems especially if its related to web crawler.

Until next time.

System Design – Building Large Scale Web Crawler

Problem Statement

Approach

My Approach

Back of the Envelop Calculation

Propose High-Level Design and Get a Buy In

Dive Deep into Individual Component Design

URL Job Processor Flow

Jobs DB Schema Design

Conclusion

Related

Become an Aficionado

Recent

Search

Problem Statement

Approach

My Approach

Back of the Envelop Calculation

Propose High-Level Design and Get a Buy In

Dive Deep into Individual Component Design

URL Job Processor Flow

Jobs DB Schema Design

Conclusion

Related

Footer

Become an Aficionado

Recent

Search

Tags