47-Site AI Research Network: Year One

WHY PUBLISH THIS.

AI visibility research in 2026 is dominated by vendor-published claims with thin methodology and tool-published statistics with vendor incentives. There is very little independent measurement at scale. This network exists to fill that gap.

The network has run for 12 months. The findings have informed every audit I ship. The methodology is documented here in enough detail for anyone to replicate it on their own infrastructure. The data caveats are stated honestly - this is research, not marketing.

THE NETWORK DESIGN.

47 production websites across 9 industries, deliberately chosen for variation:

Law (5 sites): solo and small-firm legal practices, including bilingual sites.
Medical / health (5 sites): clinics, telemedicine, health-content publishers.
Finance (5 sites): independent advisor sites, fintech blogs, comparison sites.
SaaS (6 sites): B2B SaaS marketing pages with separate product apps.
E-commerce (6 sites): Shopify and custom storefronts, varied product categories.
News / media (5 sites): independent and niche-publication news sites.
Documentation (5 sites): developer docs, technical knowledge bases.
Personal / portfolio (5 sites): individual practitioner sites, blogs, research portfolios.
Educational (5 sites): course providers, online learning platforms.

INSTRUMENTATION.

Logging is at the CDN layer for full coverage including blocked requests. Cloudflare Logpush on most sites; Fastly on a few. Logs ship to R2 / S3 hourly, get parsed by a stdlib Python pipeline (article N. 17), and land in DuckDB for analysis.

Every AI user-agent gets classified into one of: bot family (OpenAI, Anthropic, Perplexity, Google, ByteDance, Meta, Apple, Microsoft, Common Crawl), verified vs spoofed (via reverse-DNS round-trip on a 1% sample, see article N. 14), and request type (RSS, sitemap, HTML, schema endpoint, robots.txt, llms.txt, image, PDF).

Citation tracking runs separately. 100 queries per site per quarter, run across ChatGPT, Claude, Perplexity, Google AI Overviews. Manual screenshot capture (per article N. 31). Citations extracted to a tracking spreadsheet, joined back to the bot-log data for input-output correlation analysis.

Finding 01.

Recap

FINDING 01: LLMS.TXT IS NEVER REQUESTED.

Across all 47 sites, across 365 days, across every major AI user-agent: zero requests for /llms.txt. Not low - zero. The most replicable finding in the project; verifiable by anyone with server logs in 10 minutes. Detailed in article N. 02.

Finding 02.

Recap

FINDING 02: RSS IS THE TOP ENDPOINT.

Approximately 40% of AI fetcher requests went to RSS / Atom endpoints. Detailed in article N. 16.

Finding 03.

Recap

FINDING 03: THE CONSUMPTION HIERARCHY.

Bot consumption ranked: RSS ~40%, HTML ~25%, sitemap.xml ~14%, schema endpoints ~9%, images ~6%, PDF ~4%, robots.txt + llms.txt ~2%. The discovery layer (feeds, sitemaps, robots.txt) is roughly half of all AI bot traffic.

Finding 04.

Recap

FINDING 04: THE BOTS BEHAVE DIFFERENTLY.

ChatGPT (across its 3 bots), Claude (across its 3 bots), Perplexity, Bytespider, and Google all have distinct crawl personalities. Detail in articles N. 06, N. 07, N. 13.

Finding 05.

Recap

FINDING 05: FRESHNESS COMPOUNDS.

Sites publishing weekly get AI bot traffic at roughly 3x the rate of sites publishing less often. Sites silent for 60+ days drop to baseline within two weeks.

Finding 06.

Recap

FINDING 06: SILENT BLOCKS ARE COMMON.

Approximately 30% of sites in the broader audit sample (beyond the instrumented 47) had at least one major AI bot silently blocked without the owner realising. Cloudflare bot-fight mode, WAF rules, copy-pasted robots.txt are the usual culprits. Detail in article N. 09.

Finding 07.

Recap

FINDING 07: AI VISIBILITY IS MOSTLY TECHNICAL.

Roughly 70% of what determines AI visibility is technical infrastructure (the six input dimensions); roughly 30% is content quality. This inverts the standard SEO content-vs-technical ratio.

47 SITES. NINE INDUSTRIES.
ONE YEAR. PUBLISHED.

LIMITATIONS.

Honesty about what this research does NOT prove:

Sample bias: 47 sites is large enough to identify patterns but not enough to prove industry-wide claims with statistical significance. Findings are directional.
Geographic skew: most sites are operated from English-speaking markets. Bytespider's role in Asian markets is observed but under-represented in the sample.
Survivor bias: sites that joined the network self-selected for being well-instrumented. Sites with no log infrastructure are not in the sample.
Time-stable assumption: AI bot behaviour shifts every few months. The 12-month window captures medium-term patterns but does not predict next year's behaviour.
Citation tracking subjectivity: manual screenshot capture introduces session-level variability. Trends over months are reliable; week-to-week fluctuations are noise.

WHAT'S NEXT (YEAR TWO).

Planned expansions for the next 12 months:

Add 20 more sites for broader coverage. Specifically: more international sites, more high-traffic e-commerce, more regulated-industry sites.
Run targeted experiments (the deferred plan items): A/B testing internal-link patterns, named vs anonymous authorship, content age impact on citation rate.
Citation latency study (time from publish to first citation, by surface).
Co-citation network mapping for ~5 categories.
Annual refresh of the seven findings with year-two data.

OPEN METHODOLOGY.

The instrumentation stack (article N. 17) is open. The audit methodology (article N. 15) is open. The citation tracking methodology (article N. 31) is open. The 7-Dimension AI Visibility Score (article N. 03) is open.

What is not open: the specific 47 sites in the network (most are client sites; client privacy is non-negotiable). The pooled findings are public; individual site identities are not.

If you want to apply the methodology to your own site or a competitor set, the playbook is here. If you want the work done with the calibration of the broader network applied, that is what the audit is for.

THE BOTTOM LINE.

One year of measurement, seven findings, methodology open. This is the foundation everything else on this site is built on. The research will continue in year two; expect findings to refine, pivot, and occasionally reverse as the AI bot landscape moves. That is what research looks like: a living methodology, not a marketing claim frozen in time.

47 SITES. 9 INDUSTRIES.
YEAR ONE.

WHY PUBLISH THIS.

THE NETWORK DESIGN.

INSTRUMENTATION.

FINDING 01: LLMS.TXT IS NEVER REQUESTED.

FINDING 02: RSS IS THE TOP ENDPOINT.

FINDING 03: THE CONSUMPTION HIERARCHY.

FINDING 04: THE BOTS BEHAVE DIFFERENTLY.

FINDING 05: FRESHNESS COMPOUNDS.

FINDING 06: SILENT BLOCKS ARE COMMON.

FINDING 07: AI VISIBILITY IS MOSTLY TECHNICAL.

LIMITATIONS.

WHAT'S NEXT (YEAR TWO).

OPEN METHODOLOGY.

THE BOTTOM LINE.

MEASURE THE LEVERS
THAT ACTUALLY EXIST.

WHY PUBLISH THIS.

THE NETWORK DESIGN.

INSTRUMENTATION.

FINDING 01: LLMS.TXT IS NEVER REQUESTED.

FINDING 02: RSS IS THE TOP ENDPOINT.

FINDING 03: THE CONSUMPTION HIERARCHY.

FINDING 04: THE BOTS BEHAVE DIFFERENTLY.

FINDING 05: FRESHNESS COMPOUNDS.

FINDING 06: SILENT BLOCKS ARE COMMON.

FINDING 07: AI VISIBILITY IS MOSTLY TECHNICAL.

LIMITATIONS.

WHAT'S NEXT (YEAR TWO).

OPEN METHODOLOGY.

THE BOTTOM LINE.

MEASURE THE LEVERSTHAT ACTUALLY EXIST.

KEEP READING.

What AI Bots Actually Read On Your Website

The 7-Dimension AI Visibility Score, Explained

The Bot Log Audit: Step-by-Step Methodology

How to Run Your Own Citation-Tracking Project

MEASURE THE LEVERS
THAT ACTUALLY EXIST.