Say Hello

THE BOT LOG AUDIT.

The exact methodology I use in every audit I ship. Seven steps, three log sources, stdlib Python only. Reproducible end to end so you can audit your own site or verify mine.

Steps
7
Log Sources
3
Tools
stdlib
Time Budget
Hours

WHY BOT LOGS ARE THE SOURCE OF TRUTH.

Everything else - schema validators, rich-result tests, third-party AI visibility tools - is a proxy for what AI bots actually do on your site. Logs are the actual record of what happened. If your audit is not grounded in logs, you are guessing about behaviour you could measure.

The methodology below is what I run on every audit engagement. It is open, reproducible, and uses only Python stdlib. Anyone with log access can run it. The reason audits still cost $2,500 is that pulling 90 days of logs across 3 layers, parsing the AI user-agent set correctly against current and deprecated names, mapping the consumption hierarchy, and writing the action list takes 15-20 hours of focused work, plus the calibration against the broader 47-site benchmark dataset.

Finding 01.
Step 01

GET THE LOGS.

Three layers in priority order:

  • CDN logs (Cloudflare Logpush, Fastly, Akamai). Most complete, includes blocked requests. Required for detecting silent blocks.
  • Server logs (nginx, Caddy, Apache). Captures everything that reached the origin. Useful when CDN logs are unavailable.
  • Application logs (your framework's request log). Only useful when CDN/server logs are unavailable. Often missing the user-agent or shaping the response in ways that obscure bot behaviour.
Finding 02.
Step 02

IDENTIFY AI USER-AGENTS.

Filter the logs against the current AI bot list (see article N. 10 for the active 2026 list). Critically: include both the named identifiers (GPTBot, ClaudeBot, etc.) AND the deprecated ones still in the wild (Claude-Web, anthropic-ai). The deprecated traffic is real and needs to be counted.

Verify a sample. Spoofed user-agents inflate the count. Run reverse-DNS or IP-range checks (article N. 14) on a 1% sample to estimate spoof rate. Subtract from the totals.

Finding 03.
Step 03

GROUP BY BOT FAMILY.

Group requests by operator: OpenAI (3 bots), Anthropic (3 bots), Perplexity (2 bots), Google (Googlebot + Google-Extended), ByteDance, Apple, Microsoft, Common Crawl, Meta, etc.

This grouping matters because policy decisions are usually per-operator, not per-bot. "What does OpenAI's traffic on my site look like" is the question that maps to a robots.txt change.

Finding 04.
Step 04

MAP TO URL PATTERNS.

For each bot family, count requests by URL pattern. Standard patterns:

  • /feed*, /rss*, /atom* - feed endpoints (article N. 01 found ~40% of AI traffic goes here)
  • /sitemap*.xml - sitemap endpoints (~14%)
  • /*.html or no extension - HTML pages (~25%)
  • /robots.txt, /llms.txt - discovery files (~2%, see article N. 02 for why llms.txt is dead)
  • /*.json for structured data, JSON-LD endpoints if you serve them
  • Static assets (CSS/JS/images) - separate; rarely AI-relevant
Finding 05.
Step 05

CONSUMPTION HIERARCHY.

Calculate the percentage breakdown across URL pattern types. Compare to the benchmark from the 47-site network (~40% RSS, ~25% HTML, ~14% sitemap, ~9% schema, ~6% images, ~4% PDF, ~2% robots/llms).

Significant deviation from benchmark = a finding. "Your RSS share is 5% instead of 40%" usually means your feed is broken or stripped, costing you AI fetcher engagement.

Finding 06.
Step 06

FLAG SILENT BLOCKS.

For each bot family, check the response code distribution. Healthy bot traffic looks like:

  • 200 (OK) - majority
  • 304 (Not Modified) - common for repeat visits
  • Sub-1% rate of 4xx
  • Near-zero rate of 5xx (server errors)
  • Near-zero rate of 429 (rate-limited)
403 spike
Cloudflare or WAF block. Most common silent failure (see article N. 09).
429 spike
Aggressive rate-limiting. Tune Cloudflare or origin rate-limit thresholds.
404 wave
Bot is following stale links. Check sitemap and internal-link health.
5xx wave
Origin instability under bot load. Investigate capacity.
Finding 07.
Step 07

OUTPUT THE ACTION LIST.

Translate findings into a prioritised fix list:

  • Tier 1 (30 days): silent blocks, broken feeds, missing Bing setup. Highest impact, lowest effort.
  • Tier 2 (90 days): schema gaps, RSS enrichment, robots.txt rationalisation, sitemap accuracy. Medium impact, medium effort.
  • Tier 3 (ongoing): content cadence, internal linking, citation tracking. Lower per-task impact but compounds over time.
SEVEN STEPS. EVERY AUDIT.
NO EXCEPTIONS.

WHY THE METHODOLOGY IS PUBLIC.

Anyone capable of running this on their own site can. The audit's price is not the methodology - that is here, free. The price is the calibration against the broader 47-site benchmark, the experience to interpret edge cases, the prioritisation across the seven dimensions of the score, and the deliverable format that hands cleanly to a developer.

If you run this and want a second opinion against the benchmark, that is what the audit is for. If you do not need the benchmark, the methodology is enough. Either path is honest.

Stop Guessing What AI Sees

MEASURE THE LEVERS
THAT ACTUALLY EXIST.

If you want this methodology applied to your specific site - your real logs, your real citation data, your real fix list - the audit is the productized way to do it.