Building a Polite & Fast Web Crawler

In that order.

Dennis Schubert, engineer at Mozilla and noteworthy contributor to diapsora, a distributed, open-source social network, recently observed that 70% of the load on diaspora's servers was coming from poorly-behaved bots that feed the LLMs of a few big outfits. The worst offenders, amounting to 40% of total traffic combined, were OpenAI and Amazon.

While there are zillions of articles on general crawling etiquette, there are scant millions on how to actually abide by the rules while also crawling quickly and efficiently. Having been working on a hobby crawler recently, the details are top-of-mind, so let's take a look at some that often get glossed over.

Technical Context

My queue is a table in Postgres. My crawler is written in Python and uses gevent for concurrency and asynchronous IO. (Having fallen for implicit async IO before Python3, I never did like async/await.)

Rate Limiting

The bare minimum of polite crawling is rate limiting. If a site doesn't specify a crawl-delay in robots.txt, I default to one request every five seconds. If I get 429s, I slow down. It's not complicated, in principle.

In practice, it's also not complicated. If you limit yourself to one fetch context (worker, thread, or coroutine) per domain, then it's as simple as using a local variable to track how long its been since you made your last request.

Of course, this means you'll crawl large, robust sites quite slowly, but I'd rather start with a design that makes polite the default behavior. So, I run one coroutine per domain, and I don't need a complicated queue that keeps track of how many items it's dispensing per domain-time. I don't yet have more than one fetch worker (a single process/machine can handle ~10k domains in parallel), but when I do, I'll distribute domains to workers via a hashing scheme.

I keep my worker logic simple by pushing the rate-limiting logic down to just above the network layer, in a wrapper around my call to requests.get, which keeps track of the last request time in a coroutine-local value. Here's the heart of the fetch worker:

def fetch(url):
  log.info(lib.url.get_path(url))

  if (doc := Doc.get(url)) and not Doc.should_fetch(doc):
    log.info('  still fresh, skipping')

  elif doc := Doc.fetch(url):
    log.info('  fetched')
    Doc.upsert(doc)

  if Doc.should_process(doc):
    Q.process.nq(doc.url)

When a unique domain (partition) is added to the queue, Postgres fires off a NOTIFY event that the worker is listening for, and creates a new coroutine dedicated to that partition.

Respectful Enqueuing, Efficiently

Since robots.txt/Disallow: is consulted before adding any URL to the queue, and adding URLs to the queue happens in batches that may contain many different domains, I fetch all robots.txts for given URLs in parallel inside the queue's enqueue function. This way, I wait for a maximum of one GET request per enqueue batch before I'm ready to filter out disallowed URLs.

Minimize Refetching...

While also simple in principle, this is slightly more annoying in practice due to multiple sources of data that need be consulted. Here's a sketch of Doc.should_fetch():

  1. No if the last response contains an expires header value in the future.
  2. No if a HEAD request with an etag or an if-modified-since header gives a 304.
  3. No if the HEAD response headers have a last-modified value that is older than the last visit (unlikely, but may happen if the server doesn't handle HEAD correctly).
  4. No if the last visit was reasonably recent by our own standards.
  5. Otherwise, yes!

...For Efficient Recrawling

While I was briefly tempted to call Doc.should_fetch() (or, perhaps, Doc.should_enqueue()) in the process worker to avoid adding empty work to the fetch queue, doing that check in the fetch worker has a benefit of keeping network IO (the potential HEAD request to check if a document has been modified) out of the process worker, which enables more efficient CPU saturation.

It also keeps the relationship between the workers and queues simple. If the process worker wanted to skip the fetch queue for fresh docs, it would have to enqueue them in the process queue, which creates another entrypoint into the process queue, and the potential for duplicate logic.

And How

The result is a clear division of logic and resource utilization that crawls with respect for rate limits, freshness, and necessity.

This is far from comprehensive, but it seems head and shoulders above what some code monkeys making half a million $/year are doing. I'll report back when I have some impressive numbers to justify the click-baity title.