What is Common Crawl?

Answer

Common Crawl is the largest publicly-available web corpus, with petabytes of crawled web data dating back to 2008. It is a foundational training data source for virtually every major LLM including those at OpenAI, Anthropic, Meta and Google. Sites that allow Common Crawl's bot (CCBot) in robots.txt give themselves the strongest possible chance of being known to future AI models.

Why Common Crawl matters

When OpenAI trained GPT-3, Common Crawl was the largest data source. Same for Claude. Same for LLaMA. Same for Gemini's foundational training. A site that blocks CCBot is invisible to every model trained on Common Crawl, which is most of them.

What to do

Ensure CCBot is allowed in robots.txt (User-agent: CCBot / Allow: /). The Readiness Check verifies this. Beyond that, structure your content so it parses cleanly in Common Crawl's WARC archives (clean HTML, semantic structure, server-side rendering).

Cost of blocking

Hard to overstate. Each new generation of LLMs trained on Common Crawl is another wave of AI agents that will not know about your brand. The marginal cost of allowing CCBot is essentially zero. The downstream cost of blocking is years of accumulated invisibility.

What is Common Crawl?

Why Common Crawl matters

What to do

Cost of blocking

Related questions

Want help shipping AEO into your site?