Will AI hamper our ability to crawl the web for useful data?

There were a flurry of coverage in August about OpenAI’s web crawler GPTBot. Lots of articles were published with advice about how to block the crawler using robots.txt.

Some of these examples included instructions about not just blocking GPTBot but also the crawler used by Common Crawl (CCBot). And you can see that some websites, like the New York Times, are now blocking both of these bots, amongst others.

It’s perfectly reasonable for web site owners to want to block crawlers. But I’m also sympathetic to the need for archiving and transparency which should take precedence over commercial and political preferences in some situations. There is an inevitable tussle between archives and website owners.

What’s interested me about sites starting to block Common Crawl is that I’m pretty sure that for sometime now, their dataset has been the best option for anyone needing a decent size web crawl for use in ways that don’t involve AI. For example, as a means to discover and analysed structured data from the web for other purposes.

The Common Crawl website has some examples of how the crawl has been used, including research on phishing website detection and internet censorship.

It’s just not feasible for many small startups, not-for-profits or researchers to crawl sizeable chunks of the web. Common Crawl offered the only(?) or at least one of the few options available.

But the Common Crawl data is now being extensively used to train Large Language Models. The UK Government report on Foundational AI models mentions a couple of these datasets, and I’m sure there are many others.

The report highlights that access to web crawl data is increasingly important in AI development and notes that “[existing] search engine providers may have an
advantage in obtaining higher quality web crawl data because …their crawlers are less likely to be rate-limited or blocked by website owners that want to be discovered and appear on search results pages
” (PDF, page 29).

But with websites starting to block Common Crawl it also means that other uses of those datasets are also being blocked. This also advantages existing search engine providers.

Common Crawl themselves aren’t really helping here. Look at how their website changed between August 28th and August 29th 2023. There’s a real change in messaging.

For example, their original “What We Do Page” said that “Everyone should have the opportunity to indulge their curiosities, analyze the world and pursue brilliant ideas. Small startups or even individuals can now access high quality crawl data that was previously only available to large search engine corporation“. Now they are leaning into their pivotal involvement in LLM development.

From an impact reporting point of view, that makes sense. But reads a bit tone deaf in the face of the broader concerns around LLMs.

Setting aside the legality of what can or can’t be done with web crawl data, who gets to define what it acceptable and whether a corpus like this can be “open”, I think this is useful example of not just how concerns about one (mis)use of data (training LLMs) may end up harming our ability to do other beneficial things (combat censorship), but also in how we design and govern these kinds of data institutions so that they serve the needs of all of their users.

There are no easy answers here. But like licensing, robots.txt is a blunt instrument.