Bot Tracking Infrastructure: Setup Guide

WHY YOU NEED THIS BEFORE YOU AUDIT.

You cannot audit AI bot behaviour without bot logs. Most sites have no centralised log pipeline; they rely on Cloudflare's dashboard summaries (which truncate, aggregate, and miss the AI-specific user-agent breakdown) or scattered server logs (which require ssh, grep, and tedious aggregation).

The minimal stack below takes one afternoon to set up and one evening of patience while data backfills. Once it is running, every audit, every citation-tracking project, every robots.txt change becomes verifiable in seconds rather than guesswork.

Finding 01.

Layer 01

CDN LOG SOURCE.

Cloudflare Logpush is the easiest source. Available on the free plan with limitations; full export on the paid plans starting at $20/month. Logs ship to S3, R2, GCS, or any HTTP endpoint of your choice.

Alternative sources: Fastly, Akamai, AWS CloudFront. All have similar logpush offerings. The pattern is the same: emit JSON-line logs with one entry per request, including user-agent, URL, status, timestamp, response size, and source IP.

If you do not run a CDN, the next best is the origin server log (nginx, Caddy, Apache). Slightly less complete because anything blocked at the CDN never reaches the origin, but enough for most audits.

Finding 02.

Layer 02

PARSER (STDLIB PYTHON).

The parser does three things: read JSON-line logs, classify user-agents into bot families (real bot / spoofed / human), output a normalised CSV or Parquet file. Stdlib only - no Pandas required for the input volumes most sites see.

The user-agent classifier is the only piece worth real effort. It needs to handle the current AI bot list (article N. 10), the deprecated identifiers (so old logs are still classifiable), and a fast-path for the verified-IP-range check (article N. 14) on a sample basis to estimate spoof rate.

Output format: one row per request, columns for timestamp, bot_family, user_agent, url, status, bytes, source_ip, verified (boolean from IP check).

Finding 03.

Layer 03

STORE (DUCKDB).

DuckDB is the right tool for this scale. SQLite is fine for under 10M rows; DuckDB is comfortable to 1B+. Importantly, DuckDB queries Parquet files directly, so you can store the parser output as Parquet and query it without a load step.

Typical schema: a single requests table with the columns from the parser output, plus partitioning by date. A second table sites mapping each domain to the client/owner. Joins are cheap.

Common queries: AI bot traffic share by site, by week, by user-agent. Silent-block detection (pages returning 403 to verified bot IPs). Consumption hierarchy (RSS / sitemap / HTML / schema percentages).

Finding 04.

Layer 04

VISUALISE (STATIC DASHBOARDS).

For an audit deliverable, a static HTML dashboard generated from DuckDB query output is sufficient. Run the queries, render to HTML with a templating step, ship as part of the audit PDF or a private URL.

For ongoing monitoring across the 47-site network, Grafana with the DuckDB source plugin works. Less polish than commercial dashboards but adequate. The point of the stack is to enable analysis, not to replace Datadog.

If you are running this for a single client, skip Grafana entirely. Static HTML reports per quarter are enough.

MINIMAL VIABLE SETUP.

If you want to spin this up tomorrow, the smallest configuration that delivers value:

Cloudflare Logpush -> R2 bucket (free up to 10GB).
Cron job that downloads new log files daily.
Python parser script (200-300 lines) that classifies and outputs Parquet.
DuckDB CLI for ad-hoc queries during audits.
No Grafana. No real-time. Daily batch is enough for an audit shop.

THE BOTTOM LINE.

Bot tracking infrastructure is not optional if you want to operate audits seriously. The stack above is what runs the 47-site network and powers every audit deliverable. It is open, replicable, and cheap. Set it up once, maintain it monthly, and you have the substrate for every measurement question that follows.

BOT TRACKING
INFRASTRUCTURE.

WHY YOU NEED THIS BEFORE YOU AUDIT.

CDN LOG SOURCE.

PARSER (STDLIB PYTHON).

STORE (DUCKDB).

VISUALISE (STATIC DASHBOARDS).

MINIMAL VIABLE SETUP.

THE BOTTOM LINE.

MEASURE THE LEVERS
THAT ACTUALLY EXIST.

WHY YOU NEED THIS BEFORE YOU AUDIT.

CDN LOG SOURCE.

PARSER (STDLIB PYTHON).

STORE (DUCKDB).

VISUALISE (STATIC DASHBOARDS).

MINIMAL VIABLE SETUP.

THE BOTTOM LINE.

MEASURE THE LEVERSTHAT ACTUALLY EXIST.

KEEP READING.

The Bot Log Audit: Step-by-Step Methodology

How to Run Your Own Citation-Tracking Project

Verifying Real GPTBot from Spoofers

The Cloudflare AI Bot Block

MEASURE THE LEVERS
THAT ACTUALLY EXIST.