The Common Crawl Pipeline, Explained

WHAT COMMON CRAWL IS.

Common Crawl is a non-profit that has been running monthly bulk crawls of the public web since 2008. Every month they publish a snapshot of billions of pages in WARC format (Web ARChive). The data is free, public, and downloadable by anyone.

Most major foundation models are trained partially on Common Crawl. C4 (used by Google's T5 and many derivatives), RefinedWeb (used by Falcon), FineWeb (used by HuggingFace's open models), and OpenAI's training data all derive from Common Crawl snapshots.

If you are wondering whether "opting out of GPTBot" is enough to keep your content out of OpenAI's training data, the answer is: probably not. They also use Common Crawl, which uses CCBot. You have to opt out of CCBot too.

Finding 01.

The pipeline

CRAWL → DATASET → TRAINING.

The full pipeline has four stages:

Stage 1 - Crawl: CCBot crawls the public web monthly. Respects robots.txt. Stores raw HTML in WARC files.
Stage 2 - Filter: organisations like AllenAI, HuggingFace, and individual labs filter the WARC archive into curated datasets (C4 strips low-quality content; RefinedWeb deduplicates aggressively; FineWeb adds quality scoring).
Stage 3 - Train: AI labs use the filtered datasets as one input among many for foundation-model training. Training cycles run weeks to months.
Stage 4 - Deploy: trained models ship in products (ChatGPT, Claude, Gemini) months after the training cutoff date.

Finding 02.

Training lag

WHEN DOES TODAY'S CONTENT SHOW UP?

Realistic timeline from "published" to "in a deployed model":

Day 0: content published.
Day 0-30: next monthly Common Crawl snapshot picks it up (assuming CCBot is allowed).
Day 30-90: snapshot processed into derivative datasets (C4, RefinedWeb, FineWeb).
Day 90-365: AI lab incorporates dataset into next training cycle.
Day 180-540: trained model ships in production. Users start seeing your content cited.

THIS IS WHY FRESH CONTENT MATTERS DIFFERENTLY.

The Common Crawl path applies to training-data inclusion. It does not apply to live web search. ChatGPT, Claude, and Perplexity all run real-time web searches that bypass training data entirely - that is the role of OAI-SearchBot, Claude-SearchBot, and PerplexityBot.

So you have two parallel paths to AI visibility:

Training path (slow, archival): Common Crawl + dataset filtering + training cycles. Lag 6-18 months. Your old content shapes how models reason about your topic in general.
Search path (fast, fresh): OAI-SearchBot + Claude-SearchBot + PerplexityBot. Lag hours to days. Your new content shows up in citations when someone asks a relevant query.

OPTING OUT.

If you want to keep your content out of training data, you need to block at multiple layers:

User-agent: CCBot + Disallow: / (Common Crawl)
User-agent: GPTBot + Disallow: / (OpenAI training)
User-agent: ClaudeBot + Disallow: / (Anthropic training)
User-agent: Google-Extended + Disallow: / (Google's training opt-out, separate from Googlebot)
User-agent: Applebot-Extended + Disallow: / (Apple's training opt-out)
User-agent: cohere-ai + Disallow: / (Cohere)
User-agent: Bytespider + Disallow: / (ByteDance)
User-agent: Meta-ExternalAgent + Disallow: / (Meta)

THE BOTTOM LINE.

Common Crawl is the substrate underneath most AI training. CCBot is the ingestion point. The pipeline takes 6-18 months from crawl to deployed model. If your content is allowed for crawling and processing, it will eventually shape how AI models reason about your domain. If it is blocked, you opt out of that surface but preserve real-time web-search citation visibility separately.

Two paths, two policies. Decide on each independently.

THE COMMON CRAWL
PIPELINE.

WHAT COMMON CRAWL IS.

CRAWL → DATASET → TRAINING.

WHEN DOES TODAY'S CONTENT SHOW UP?

THIS IS WHY FRESH CONTENT MATTERS DIFFERENTLY.

OPTING OUT.

THE BOTTOM LINE.

MEASURE THE LEVERS
THAT ACTUALLY EXIST.

WHAT COMMON CRAWL IS.

CRAWL → DATASET → TRAINING.

WHEN DOES TODAY'S CONTENT SHOW UP?

THIS IS WHY FRESH CONTENT MATTERS DIFFERENTLY.

OPTING OUT.

THE BOTTOM LINE.

MEASURE THE LEVERSTHAT ACTUALLY EXIST.

KEEP READING.

robots.txt for AI Bots: The 2026 Configuration

Bot Access Policy: Allow vs Tier vs Block

Why I Don't Guarantee AI Citation Outcomes

What AI Bots Actually Read On Your Website

MEASURE THE LEVERS
THAT ACTUALLY EXIST.