Say Hello

WHAT AI BOTS
ACTUALLY READ.

I instrumented 47 websites across 9 industries and watched what AI fetchers actually consume. Seven findings that contradict most of what the industry recommends - including the one about llms.txt that nobody wants to hear.

Sites
47
Industries
9
Findings
7
Timeframe
90 days

WHY THIS REPORT EXISTS.

Every SEO blog in 2026 has an opinion about "optimizing for AI search". Most of those opinions are recycled best-guesses with no underlying data. I wanted to know what AI fetchers actually do - not what vendors claim they do - so I instrumented my own research network of 47 websites across 9 industries and logged every AI bot visit for 90 days.

The network spans law, medical, finance, SaaS, e-commerce, news, documentation, media, and personal sites - with deliberate variation in CMS, page count, schema implementation, and content freshness. Every AI user-agent gets logged at the CDN level, with full path, response code, and timing data.

A note on numbers: everything below is expressed in percentages and relative rates, not raw counts. My sample is finite and the absolute numbers would misrepresent the real question - which is which signals and formats matter proportionally, not which specific sites got the most traffic.

Finding 01.
The one nobody wants to hear

LLMS.TXT HAS NEVER BEEN REQUESTED.

~0%

Across all 47 sites, across 90 days, across every major AI user-agent logged - the `/llms.txt` endpoint was requested zero times. Not low, not rare - zero. This matches what anyone with server logs can verify in 10 minutes.

The llms.txt "standard" was proposed in 2024 as a convention for publishers to signal AI-readable content. Every major AI company was invited to adopt it. None have. Not OpenAI, not Anthropic, not Google, not Perplexity, not Mistral. The fetchers in the wild don't look for the file, don't read it when it's there, and don't change their behavior based on its contents.

If someone is charging you to implement llms.txt as "AI visibility optimization", they are either unaware of this fact or actively deceiving you. It costs nothing to add the file - just don't expect it to do anything, and don't build strategy around it.

Finding 02.
What's actually most consumed

RSS IS THE MOST-CONSUMED ENDPOINT.

~40%

Roughly 40% of all AI fetcher requests across the network went to RSS and feed endpoints - `/feed`, `/rss`, `/feed.xml`, `/atom.xml`. This beat HTML pages, beat sitemaps, beat schema-heavy URLs, beat everything.

Why RSS? Because it's structured, predictable, chronological, and cheap to parse. When an AI system wants to know "what's new on this site?", feeds give them the answer in a fraction of the bandwidth of crawling HTML. Sites without feeds show up less often in AI crawl patterns. Sites with malformed feeds (broken XML, missing dates) show up even less.

The actionable insight: make sure you have a valid feed, make sure it's in your robots.txt allowed paths, and make sure the dates in it are actually accurate. Most CMS defaults break this in subtle ways.

Finding 03.
The consumption hierarchy

A RANKED ORDER OF WHAT BOTS CONSUME.

RSS / Feed
~40%
HTML Articles
~25%
Sitemap XML
~14%
Schema / JSON-LD
~9%
Images / Media
~6%
PDF / Docs
~4%
robots.txt / llms.txt
~2%

The single biggest surprise in this data is how much AI bots care about structured feeds and discovery endpoints (RSS, sitemaps, robots.txt) relative to actual content pages. They're probing the shape of your site first, then deciding what to read. If your discovery layer is broken, the content layer barely matters.

Finding 04.
Bot personalities differ

CHATGPT, CLAUDE, AND PERPLEXITY BEHAVE DIFFERENTLY.

CHATGPT-USER
On-demand. Visits pages specifically when users ask about them. Strong preference for fresh content. Hits HTML directly more than feeds.
CLAUDE-WEB
Thoughtful, slower. Reads fewer pages but spends longer on each. Higher retry rate on 404s. Respects rate limits unusually well.
PERPLEXITYBOT
Aggressive, broad. Highest total volume. Crawls feeds and sitemaps first, then hits everything. Most likely to trigger rate limits on smaller sites.

Treating "AI bots" as a single category is a mistake. Each fetcher has distinct behavior, distinct priorities, and distinct failure modes. Optimization that helps one can actively hurt another - for example, aggressive rate-limiting to control Perplexity can accidentally block ChatGPT's infrequent-but-user-driven visits.

Finding 05.
Freshness beats everything

FRESH CONTENT COMPOUNDS. STALE SITES DISAPPEAR.

~3x

Sites that publish new content at least weekly get AI bot visits at roughly 3x the rate of sites that publish less often. This isn't controversial - it mirrors traditional Googlebot behavior. What's notable is how quickly the effect decays.

A site that goes silent for 60+ days sees its AI bot traffic drop to near-baseline levels within roughly two weeks. The bots don't stop checking entirely, but they check far less often and they pull far fewer pages when they do.

The implication: AI visibility is not a one-time optimization. It's a freshness signal that needs feeding. If you can't commit to regular publishing, other signals (link quality, schema completeness, authoritative backlinks) can only compensate so much.

Finding 06.
The silent failure

~30% OF SITES ARE BLOCKING AI BOTS BY ACCIDENT.

~30%

When I added a fresh set of comparable sites to the network (beyond the instrumented 47), about 3 in 10 were silently blocking at least one major AI bot without the owner realizing. The usual culprits:

robots.txt with overzealous rules - often copy-pasted from an older template that predates AI bots. Cloudflare bot fight mode toggled on - blocks everything that isn't a verified search engine. WAF rules blocking user-agents with "bot" in the string. 403s from aggressive rate-limiting - fine for human traffic, catastrophic for bots that expect retries.

The scariest part: none of these show up in Google Search Console or typical monitoring tools. You have to check the logs directly. Most site owners discover the problem months after it started, if at all.

Finding 07.
The split

AI VISIBILITY IS MOSTLY TECHNICAL, NOT CONTENT.

~70%

Looking across everything above - the feeds, the sitemaps, the accidental blocks, the rate limits, the crawl budgets, the structured data - about 70% of what determines your AI visibility is technical infrastructure. Only about 30% is content quality, content freshness, and content relevance.

This inverts the standard SEO industry framing, which spends 80% of its effort on content and 20% on technical. For AI search, that ratio is backwards. If your technical foundation is broken, the best content in your industry won't save you. If your technical foundation is solid, modest content can still earn solid AI visibility.

The good news: technical problems are fixable. Most of them are fixable in a single engagement. Content is an ongoing commitment. Infrastructure is a project.

SEVEN FINDINGS.
MOST CONTRADICT WHAT THE INDUSTRY RECOMMENDS.

WHAT THIS MEANS FOR YOU.

If you've read this far and you're thinking "I have no idea whether any of this applies to my site", that's not a content problem - that's a measurement problem. You can't know without pulling your own logs and running the tests.

You can do that work yourself. The methodology is straightforward: get CDN or server log access, filter for AI user-agents, check which endpoints they're hitting, verify your feeds and sitemaps parse correctly, and audit your robots.txt against the current list of AI bots. It takes a skilled engineer roughly 40-60 hours to do thoroughly the first time.

Or you can book the audit, where I do all of that for a fixed price and hand back a prioritized fix list. Either way - stop guessing what AI sees.

Two Paths Forward

MEASURE IT. OR HAVE ME MEASURE IT.

Either way, the answer is in the logs. Most sites have never looked.