Say Hello

THE COMMON CRAWL
PIPELINE.

Common Crawl runs monthly bulk crawls, archives the entire public web, and feeds most major AI training datasets. The lag between your content going live and showing up in a model trained on Common Crawl is typically 6-18 months. Here is the pipeline, end to end.

Crawl Cadence
Monthly
Snapshot Size
Petabytes
Training Lag
6-18 months
Robots.txt
Respected

WHAT COMMON CRAWL IS.

Common Crawl is a non-profit that has been running monthly bulk crawls of the public web since 2008. Every month they publish a snapshot of billions of pages in WARC format (Web ARChive). The data is free, public, and downloadable by anyone.

Most major foundation models are trained partially on Common Crawl. C4 (used by Google's T5 and many derivatives), RefinedWeb (used by Falcon), FineWeb (used by HuggingFace's open models), and OpenAI's training data all derive from Common Crawl snapshots.

If you are wondering whether "opting out of GPTBot" is enough to keep your content out of OpenAI's training data, the answer is: probably not. They also use Common Crawl, which uses CCBot. You have to opt out of CCBot too.

Finding 01.
The pipeline

CRAWL → DATASET → TRAINING.

The full pipeline has four stages:

  • Stage 1 - Crawl: CCBot crawls the public web monthly. Respects robots.txt. Stores raw HTML in WARC files.
  • Stage 2 - Filter: organisations like AllenAI, HuggingFace, and individual labs filter the WARC archive into curated datasets (C4 strips low-quality content; RefinedWeb deduplicates aggressively; FineWeb adds quality scoring).
  • Stage 3 - Train: AI labs use the filtered datasets as one input among many for foundation-model training. Training cycles run weeks to months.
  • Stage 4 - Deploy: trained models ship in products (ChatGPT, Claude, Gemini) months after the training cutoff date.
Finding 02.
Training lag

WHEN DOES TODAY'S CONTENT SHOW UP?

Realistic timeline from "published" to "in a deployed model":

  • Day 0: content published.
  • Day 0-30: next monthly Common Crawl snapshot picks it up (assuming CCBot is allowed).
  • Day 30-90: snapshot processed into derivative datasets (C4, RefinedWeb, FineWeb).
  • Day 90-365: AI lab incorporates dataset into next training cycle.
  • Day 180-540: trained model ships in production. Users start seeing your content cited.

THIS IS WHY FRESH CONTENT MATTERS DIFFERENTLY.

The Common Crawl path applies to training-data inclusion. It does not apply to live web search. ChatGPT, Claude, and Perplexity all run real-time web searches that bypass training data entirely - that is the role of OAI-SearchBot, Claude-SearchBot, and PerplexityBot.

So you have two parallel paths to AI visibility:

OPTING OUT.

If you want to keep your content out of training data, you need to block at multiple layers:

THE BOTTOM LINE.

Common Crawl is the substrate underneath most AI training. CCBot is the ingestion point. The pipeline takes 6-18 months from crawl to deployed model. If your content is allowed for crawling and processing, it will eventually shape how AI models reason about your domain. If it is blocked, you opt out of that surface but preserve real-time web-search citation visibility separately.

Two paths, two policies. Decide on each independently.

Stop Guessing What AI Sees

MEASURE THE LEVERS
THAT ACTUALLY EXIST.

If you want this methodology applied to your specific site - your real logs, your real citation data, your real fix list - the audit is the productized way to do it.