<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
     xmlns:atom="http://www.w3.org/2005/Atom"
     xmlns:content="http://purl.org/rss/1.0/modules/content/"
     xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>Misha Manko · Research</title>
    <link>https://mishamanko.com/research</link>
    <description>Independent research on how AI models actually read the web. Raw findings from a 47-site bot research network. Published when there is something measured worth saying.</description>
    <language>en</language>
    <copyright>Copyright 2026 Misha Manko</copyright>
    <managingEditor>hello@mishamanko.com (Misha Manko)</managingEditor>
    <webMaster>hello@mishamanko.com (Misha Manko)</webMaster>
    <category>AI</category>
    <category>SEO</category>
    <category>Web Research</category>
    <generator>scripts/build-feed.py</generator>
    <ttl>1440</ttl>
    <atom:link href="https://mishamanko.com/research/feed.xml" rel="self" type="application/rss+xml"/>
    <lastBuildDate>Mon, 27 Apr 2026 09:52:31 +0000</lastBuildDate>
    <pubDate>Mon, 27 Apr 2026 09:30:00 +0000</pubDate>
    <image>
      <url>https://mishamanko.com/icon-512.png</url>
      <title>Misha Manko · Research</title>
      <link>https://mishamanko.com/research</link>
      <width>512</width>
      <height>512</height>
    </image>

    <item>
      <title>Schema Markup That AI Models Actually Use - Misha Manko</title>
      <link>https://mishamanko.com/research/schema-markup-that-ai-models-actually-use</link>
      <guid isPermaLink="true">https://mishamanko.com/research/schema-markup-that-ai-models-actually-use</guid>
      <pubDate>Mon, 27 Apr 2026 09:30:00 +0000</pubDate>
      <dc:creator>Misha Manko</dc:creator>
      <author>hello@mishamanko.com (Misha Manko)</author>
      <description>Six schema types do 80% of the work. The other 700+ are theatre. A measurement-led guide to AI-visibility schema based on 90 days of fetcher logs across 47 sites.</description>
      <content:encoded><![CDATA[
<p><em>Schema is the most ridden hobby horse in SEO. Every plugin promises rich results. Every audit flags missing schema. After 90 days of measured AI fetcher traffic across 47 sites, the picture of what actually matters is much narrower than the average optimisation checklist.</em></p>

<p><strong>Sample:</strong> 47 sites · 9 industries · 90 days · ~9% of fetcher traffic on schema endpoints</p>

<h2>Why Schema Matters Disproportionately</h2>
<p>Schema is roughly 9% of all AI fetcher traffic across the network - much smaller than RSS (~40%) or HTML (~25%). But the citation share is far higher than that 9% suggests. When an AI model wants to extract a fact about your business, your product, or your author, it goes to the schema first. It is the highest-density-per-byte source of structured truth on most websites.</p>
<p>That is also why getting it wrong is expensive. Bad schema does not just fail to help - it can <strong>actively contradict</strong> the visible page content, and AI models will sometimes prefer the schema over the prose. If your <code>Organization</code> name in JSON-LD is misspelled, that misspelling can show up in citations.</p>

<h2>Finding 01 - JSON-LD wins. Microdata is dead. RDFa is irrelevant.</h2>
<p>Of the schema requests AI fetchers made across the network, the format breakdown was unambiguous:</p>
<ul>
  <li><strong>JSON-LD</strong> in a <code>&lt;script type="application/ld+json"&gt;</code> tag - ~95% of all AI-extracted structured data.</li>
  <li><strong>Microdata</strong> (the <code>itemscope</code>/<code>itemtype</code>/<code>itemprop</code> attribute soup) - ~4%, mostly on legacy WordPress and Shopify themes.</li>
  <li><strong>RDFa</strong> - ~1%, almost entirely on academic and government sites.</li>
</ul>
<p>If your CMS or theme is still serving microdata, that is fine - you don't need to rip it out. But every new schema you add should be JSON-LD, and any time you touch a template, replace microdata with JSON-LD on the way through. The parsing path is cleaner, the validation tooling is better, and the AI fetchers strongly prefer it.</p>

<h2>Finding 02 - Six schema types do most of the work</h2>
<p>Across 47 sites and 9 industries, the schema types that actually get extracted and cited cluster heavily. Here is the rough rank order of AI fetcher engagement with each type:</p>
<ul>
  <li><strong>Organization + Person</strong> (entity grounding) - ~28%. Used to anchor "who is this site" and "who wrote this" - foundational for citation attribution.</li>
  <li><strong>Article + NewsArticle</strong> (publication metadata) - ~22%. Title, author, datePublished, dateModified, body excerpts. The default for any blog or research post.</li>
  <li><strong>Product + Offer</strong> (commerce) - ~18%. Price, availability, brand, SKU, ratings. Heavy use by retail-aware AI surfaces.</li>
  <li><strong>FAQPage + QAPage</strong> (Q-and-A extraction) - ~14%. Direct lift into AI answers when the question matches user intent.</li>
  <li><strong>BreadcrumbList</strong> (site structure) - ~10%. AI models use these to understand site hierarchy and topical clustering.</li>
  <li><strong>Recipe</strong> (well-supported, narrow) - ~5%. Heavy for cooking sites, irrelevant otherwise. Listed for completeness.</li>
  <li><strong>Other types</strong> - ~3% combined. Long tail of low-engagement schemas.</li>
</ul>
<p>Six types account for ~97% of all AI schema engagement in the network. The other 700+ schema.org types share the remaining 3%. This is the part of the data nobody wants to talk about, because it implies that 90% of the schema configuration in plugins like Yoast and RankMath is doing nothing useful.</p>

<h2>Finding 03 - The schema types that are theatre</h2>
<p>Some schema types get pitched constantly in SEO blogs but produce no measurable AI engagement on the sites where they have been deployed. These are the ones I would not spend any time on:</p>
<ul>
  <li><strong>Speakable</strong> - originally proposed for voice assistants. Effectively unused by current AI surfaces. Adding it does nothing.</li>
  <li><strong>ClaimReview</strong> - only useful if you are a verified fact-checker (Snopes, PolitiFact, etc.). For a regular publisher, it is ignored or actively penalised by Google's verification requirements.</li>
  <li><strong>HowTo</strong> - Google deprecated the rich result. AI models read it inconsistently. The rare engagement does not justify the markup overhead.</li>
  <li><strong>VideoObject</strong> on a non-YouTube embed - mostly inert. If your video is on YouTube, the YouTube schema travels with the embed and your local <code>VideoObject</code> is redundant. If your video is self-hosted, AI fetchers rarely engage with it.</li>
  <li><strong>JobPosting</strong> on a non-aggregator site - useful on Indeed or LinkedIn, irrelevant on a single-company careers page that already has a few openings.</li>
  <li><strong>CourseInstance, EducationalOccupationalCredential, MonetaryGrant</strong>, and the long tail of micro-types - effectively no engagement.</li>
</ul>
<p>The "schema everything" trap costs hours and produces nothing. If your audit consultant is recommending you add 15 schema types to every page, ask them to demonstrate the engagement. They cannot.</p>

<h2>Finding 04 - The mistakes that nuke your schema</h2>
<p>Even the high-signal schema types fail when they are deployed badly. The four most common mistakes I see in audits:</p>
<ul>
  <li><strong>Multiple disjoint @graph fragments</strong> - several JSON-LD blocks on the same page that do not reference each other. AI fetchers struggle to merge them, and may pick one and ignore the rest. Always use a single <code>@graph</code> array with internal <code>@id</code> references.</li>
  <li><strong>Missing @id resolution</strong> - declaring an <code>Organization</code> without a stable <code>@id</code> URL means every page on your site presents a fresh, unconnected entity. Use canonical <code>@id</code> values like <code>https://yoursite.com/#organization</code> so AI models can deduplicate across pages.</li>
  <li><strong>Schema that contradicts visible content</strong> - JSON-LD says the price is $99, the page says $79. AI models notice. Some pick the schema, some pick the visible price, and the inconsistency itself can suppress citation. Schema should mirror the page, not invent it.</li>
  <li><strong>Dynamic schema injected after JS render</strong> - schema added by client-side JavaScript shows up in browser DevTools but not in raw HTML. Some AI fetchers do not execute JS. Server-render your schema or you are gambling on whose crawler shows up.</li>
</ul>
<p>Most of these are silent failures. Your schema validates, your rich result test passes, and AI engagement is still zero. The validators check syntax, not semantics. The only real test is what the bots actually do.</p>

<h2>Finding 05 - The minimum viable AI-visibility schema</h2>
<p>If you implement nothing else, implement these five types correctly and you will cover roughly 80% of the AI schema engagement that any normal site can earn:</p>
<ul>
  <li><strong>Organization</strong> at site root, with stable <code>@id</code>, name, URL, logo, and sameAs links to your social profiles.</li>
  <li><strong>Person</strong> for the site owner or each named author, also with stable <code>@id</code>, name, URL, and sameAs.</li>
  <li><strong>WebSite</strong> with a <code>SearchAction</code> potential action - signals you have an internal search and lets some AI surfaces deep-link queries.</li>
  <li><strong>BreadcrumbList</strong> on every non-root page. Cheap to generate, broadly consumed.</li>
  <li><strong>Article</strong> on every blog or research post, with <code>headline</code>, <code>author</code> (referencing your Person <code>@id</code>), <code>datePublished</code>, <code>dateModified</code>, and <code>mainEntityOfPage</code>.</li>
</ul>
<p>That is it. Five types, deployed correctly, with internal <code>@id</code> resolution, server-rendered, mirroring the visible content. You do not need <code>HowTo</code>, you do not need <code>Speakable</code>, you do not need a separate schema for your favourite social network.</p>
<p>If you are running an e-commerce site, add <code>Product</code> and <code>Offer</code> on product pages. If you are publishing news, add <code>NewsArticle</code> instead of (or alongside) <code>Article</code>. That is the extension. The core stays the same.</p>

<h2>What about schema generators and plugins?</h2>
<p>RankMath, Yoast, Schema Pro, and the Shopify schema apps all do roughly the same thing: they emit schema based on your CMS metadata. They tend to over-emit (lots of low-engagement types) but rarely under-emit. The fix is not to throw them out - it is to <strong>audit what they are emitting</strong>, suppress the noise, and tighten the high-engagement types.</p>
<p>Two specific things worth checking: (1) does your plugin emit a single <code>@graph</code> with internal <code>@id</code> references, or does it scatter multiple disjoint blocks across the page? (2) Does it server-render the schema, or inject it via JavaScript? If the answers are bad, that is the place to invest.</p>

<h2>The Bottom Line</h2>
<p>Schema is one of the highest-leverage AI visibility interventions, but only when it is targeted. The default of "add every schema type your CMS supports" produces noise. The smart default is five types, deployed correctly, with stable identity references and consistent content.</p>
<p>If you want a useful test: open your site in <a href="https://search.google.com/test/rich-results">the Google Rich Results Test</a>, look at the structured data block, and ask yourself: <em>which of these would I bet money that an AI model actually uses?</em> If the answer is "none of them" or "I have no idea," you have a schema problem. The fix is rarely "more schema." The fix is usually "less, but better."</p>
]]></content:encoded>
    </item>
    <item>
      <title>Why llms.txt Is Dead In The Water - Misha Manko</title>
      <link>https://mishamanko.com/research/why-llms-txt-is-dead</link>
      <guid isPermaLink="true">https://mishamanko.com/research/why-llms-txt-is-dead</guid>
      <pubDate>Mon, 27 Apr 2026 09:00:00 +0000</pubDate>
      <dc:creator>Misha Manko</dc:creator>
      <author>hello@mishamanko.com (Misha Manko)</author>
      <description>A deep dive on why llms.txt has zero adopters, zero requests, and zero measurable AI visibility impact - and what to do instead.</description>
      <content:encoded><![CDATA[
<p><em>LLMs.txt was supposed to be the robots.txt for AI. After 90 days of instrumented measurement across 47 sites, the verdict is in: nobody is using it. Not the AI companies, not the bots, not the publishers it was supposed to help. This is the deep dive into why.</em></p>

<p><strong>Sample:</strong> 47 sites · 9 industries · 90 days · 0 llms.txt requests</p>

<h2>What llms.txt Was Supposed To Be</h2>
<p>The proposal landed in 2024. The pitch: a small markdown file at the root of your site (<code>/llms.txt</code>) that tells large language models which content is most worth ingesting. Curated. Hierarchical. Author-friendly. The same elegant convention that <code>robots.txt</code> brought to search engines, but for the AI era.</p>
<p>It was a good idea on paper. Publishers liked it because it gave them control. SEO consultants liked it because it gave them something to sell. The W3C and IETF mailing lists got their version of the discussion. Tooling started showing up - generators, validators, WordPress plugins.</p>
<p>And then the AI companies were asked to honor it. None of them did.</p>

<h2>Finding 01 - Zero requests across 47 sites, 90 days</h2>
<p>Across the entire research network - 47 sites, 9 industries, 90 days of CDN logs filtered for every major AI user-agent - the <code>/llms.txt</code> endpoint was requested <strong>zero times</strong>. Not low. Not occasional. Zero.</p>
<p>Anyone with server logs can verify this in 10 minutes. <code>grep llms.txt access.log</code>. You will find nothing. If you find something, it is almost certainly your own validator hitting the file, or a curious human typing it into a browser, not a production AI fetcher.</p>
<p>This is the single most measurable, most replicable finding in this entire research project. It is also the one that some agencies actively don't want their clients to know.</p>

<h2>Finding 02 - No major AI lab has adopted it</h2>
<p>Here is the public position of every major AI company on llms.txt as of April 2026:</p>
<ul>
  <li><strong>OpenAI</strong> - has not adopted llms.txt. ChatGPT bot fetches via <code>GPTBot</code> and <code>OAI-SearchBot</code>; both follow <code>robots.txt</code> and sitemaps. No reference to llms.txt in their crawler documentation.</li>
  <li><strong>Anthropic</strong> - has not adopted llms.txt. Claude's web fetchers (<code>ClaudeBot</code>, <code>Claude-Web</code>, <code>anthropic-ai</code>) follow <code>robots.txt</code>. No mention of llms.txt anywhere in their public documentation.</li>
  <li><strong>Google</strong> - has not adopted llms.txt. Their AI surfaces (Gemini, Search Generative Experience) crawl through <code>Google-Extended</code> and existing Googlebot infrastructure. No llms.txt support shipped or signalled.</li>
  <li><strong>Perplexity</strong> - has not adopted llms.txt. <code>PerplexityBot</code> hits feeds, sitemaps, and HTML pages directly. They have published crawler docs; llms.txt is not in them.</li>
  <li><strong>Mistral, Meta, Cohere</strong> - same. None have shipped llms.txt support, none have committed to a timeline, none reference the file in their crawler documentation.</li>
</ul>
<p>This is not a small detail. <strong>A standard with zero adopters is not a standard.</strong> It is a proposal. The publishers who built llms.txt files are sending signals into a void.</p>

<h2>Finding 03 - It duplicates what robots.txt and sitemap.xml already do</h2>
<p>The actual mechanism by which AI models discover and ingest content has three real layers, and llms.txt is not in any of them:</p>
<ul>
  <li><strong>Common Crawl</strong> and similar bulk web archives - the foundation training datasets for most foundation models. Common Crawl follows <code>robots.txt</code>. It does not look at llms.txt.</li>
  <li><strong>First-party crawlers</strong> per AI company - <code>GPTBot</code>, <code>ClaudeBot</code>, <code>PerplexityBot</code>. These follow <code>robots.txt</code>, then traverse via sitemap, RSS feeds, and internal links. They do not look at llms.txt.</li>
  <li><strong>On-demand fetchers</strong> for retrieval-augmented generation - <code>ChatGPT-User</code>, Claude's user-triggered web tool, Perplexity's question routing. These hit URLs that users specifically ask about. They do not look at llms.txt either.</li>
</ul>
<p>So what does work? <code>robots.txt</code> with explicit allow rules for AI user-agents. A valid <code>sitemap.xml</code> covering everything indexable. A real <code>RSS</code> or <code>Atom</code> feed. Schema markup on individual pages. These are what the bots in the wild are actually consuming - measured, logged, verifiable.</p>

<h2>Finding 04 - It has become a vector for bad advice</h2>
<p>The economic reality of llms.txt is that it created a new line item for SEO and "AI optimization" agencies to bill against. The typical pitch:</p>
<p><em>"We will implement an AI-readable site signal so your content gets prioritized by ChatGPT and Claude. This includes setting up your llms.txt file with curated content references and ongoing maintenance."</em></p>
<p>I have seen quotes ranging from <strong>$500 to $2,500</strong> for llms.txt setup as a standalone service, and seen it bundled into larger "AI visibility packages" at $5K-$15K where it is one of the headline deliverables.</p>
<p>None of these engagements produce measurable AI bot behavior change. They cannot, because the bots do not read the file. The work might still be billed as "complete" and the client might still pay, but the actual mechanism by which the deliverable is supposed to influence AI visibility does not exist.</p>
<p>If you have hired someone to set up llms.txt and they have presented it as an AI visibility intervention, you have grounds to ask them to demonstrate the mechanism. They cannot. The honest version of the engagement is "I will create a file that may matter someday if a standard takes hold." That is a much harder sell.</p>

<h2>Finding 05 - What you should actually do</h2>
<p>Strip llms.txt out of your AI visibility strategy and put the time into things that move the needle, in roughly this priority order:</p>
<ul>
  <li><strong>RSS / Atom feed</strong> - the single most-consumed endpoint by AI fetchers. Make sure yours exists, validates, includes full content (not just titles), and has accurate timestamps.</li>
  <li><strong>robots.txt with explicit allow rules</strong> - many sites are silently blocking AI bots without realising. Explicit <code>Allow:</code> lines for <code>GPTBot</code>, <code>ClaudeBot</code>, <code>PerplexityBot</code>, <code>Google-Extended</code>, etc. close that gap.</li>
  <li><strong>sitemap.xml</strong> with accurate <code>lastmod</code> dates and full URL coverage. AI crawlers triangulate freshness from these dates.</li>
  <li><strong>Schema markup on real entities</strong> - <code>Organization</code>, <code>Person</code>, <code>Article</code>, <code>Product</code>. JSON-LD only. Avoid the long tail of vanity types.</li>
  <li><strong>Content freshness</strong> - a stale site goes invisible to AI bots within roughly two weeks of going dark. Regular publishing is the single biggest behavioural lever.</li>
</ul>
<p>None of these are exotic. They are the boring fundamentals that have always worked, with adjustments for which specific user-agents matter now. The reason they work is that they map to what the bots actually do, not to what the industry wishes the bots would do.</p>

<h2>Should you delete your llms.txt file?</h2>
<p>No. It costs nothing to leave it in place. It does not slow your site down, it does not confuse search engines, and there is a non-zero chance one or two AI companies eventually adopt it. The file is harmless.</p>
<p>What you should do is stop attributing AI visibility outcomes to it. Stop paying anyone who claims to "manage" it. Stop letting it crowd out time and budget that could go to RSS, schema, robots.txt, and content cadence.</p>

<h2>What Would Change My Mind</h2>
<p>Two things would reverse this position immediately:</p>
<ul>
  <li><strong>A major AI lab publicly committing</strong> to honouring llms.txt with documented user-agent behaviour. Not a vague statement of support - documented crawler behaviour, with a versioned spec they have implemented against.</li>
  <li><strong>Non-zero llms.txt requests in my logs</strong> - even a few hundred requests per quarter, from any of the major AI fetchers. So far: zero, on every site, every quarter.</li>
</ul>
<p>Until both of those conditions are met, treating llms.txt as a meaningful AI visibility intervention is a category error. It is theatre. Honest theatre, in some cases - the proposal authors clearly believed in it - but theatre nonetheless.</p>

<h2>The Bottom Line</h2>
<p>LLMs.txt is a thoughtful proposal that addresses a real problem. The problem is that no AI company has adopted it, and there is no commercial pressure on them to do so. They get the content they need from <code>robots.txt</code>-respecting crawlers and Common Crawl-style bulk datasets. They have no incentive to add a new code path for a file they cannot validate against any meaningful publisher community.</p>
<p>If you want to control what AI sees, control your <code>robots.txt</code>, your <code>sitemap.xml</code>, your feeds, and your schema. Those are the levers that exist. Llms.txt is not a lever. It is a wish.</p>
]]></content:encoded>
    </item>
    <item>
      <title>What AI Bots Actually Read On Your Website</title>
      <link>https://mishamanko.com/research/what-ai-bots-actually-read</link>
      <guid isPermaLink="true">https://mishamanko.com/research/what-ai-bots-actually-read</guid>
      <pubDate>Wed, 15 Apr 2026 09:00:00 +0000</pubDate>
      <dc:creator>Misha Manko</dc:creator>
      <author>hello@mishamanko.com (Misha Manko)</author>
      <description>Findings from a 47-site AI bot research network: which fetchers actually consume content, which pages they hit, and why llms.txt is dead in the water.</description>
      <content:encoded><![CDATA[
<p><em>I instrumented <strong>47 websites across 9 industries</strong> and watched what AI fetchers actually consume. Seven findings that contradict most of what the industry recommends - including the one about llms.txt that nobody wants to hear.</em></p>

<p><strong>Sample:</strong> 47 sites · 9 industries · 7 findings · 90 days</p>

<h2>Why This Report Exists</h2>
<p>Every SEO blog in 2026 has an opinion about "optimizing for AI search". Most of those opinions are recycled best-guesses with no underlying data. I wanted to know what AI fetchers actually do - not what vendors claim they do - so I instrumented my own research network of <strong>47 websites across 9 industries</strong> and logged every AI bot visit for 90 days.</p>
<p>The network spans law, medical, finance, SaaS, e-commerce, news, documentation, media, and personal sites - with deliberate variation in CMS, page count, schema implementation, and content freshness. Every AI user-agent gets logged at the CDN level, with full path, response code, and timing data.</p>
<p>A note on numbers: <em>everything below is expressed in percentages and relative rates</em>, not raw counts. My sample is finite and the absolute numbers would misrepresent the real question - which is which signals and formats matter proportionally, not which specific sites got the most traffic.</p>

<h2>Finding 01 - llms.txt has never been requested (~0%)</h2>
<p>Across all 47 sites, across 90 days, across every major AI user-agent logged - the <code>/llms.txt</code> endpoint was requested <strong>zero times</strong>. Not low, not rare - zero. This matches what anyone with server logs can verify in 10 minutes.</p>
<p>The llms.txt "standard" was proposed in 2024 as a convention for publishers to signal AI-readable content. Every major AI company was invited to adopt it. <strong>None have.</strong> Not OpenAI, not Anthropic, not Google, not Perplexity, not Mistral. The fetchers in the wild don't look for the file, don't read it when it's there, and don't change their behavior based on its contents.</p>
<p>If someone is charging you to implement llms.txt as "AI visibility optimization", they are either unaware of this fact or actively deceiving you. It costs nothing to add the file - just don't expect it to do anything, and don't build strategy around it.</p>

<h2>Finding 02 - RSS is the most-consumed endpoint (~40%)</h2>
<p>Roughly 40% of all AI fetcher requests across the network went to <strong>RSS and feed endpoints</strong> - <code>/feed</code>, <code>/rss</code>, <code>/feed.xml</code>, <code>/atom.xml</code>. This beat HTML pages, beat sitemaps, beat schema-heavy URLs, beat everything.</p>
<p>Why RSS? Because it's structured, predictable, chronological, and cheap to parse. When an AI system wants to know "what's new on this site?", feeds give them the answer in a fraction of the bandwidth of crawling HTML. Sites without feeds show up less often in AI crawl patterns. Sites with malformed feeds (broken XML, missing dates) show up even less.</p>
<p><strong>The actionable insight:</strong> make sure you have a valid feed, make sure it's in your robots.txt allowed paths, and make sure the dates in it are actually accurate. Most CMS defaults break this in subtle ways.</p>

<h2>Finding 03 - A ranked order of what bots consume</h2>
<p>The consumption hierarchy across all logged AI fetcher traffic:</p>
<ul>
  <li><strong>RSS / Feed</strong> - ~40%</li>
  <li><strong>HTML Articles</strong> - ~25%</li>
  <li><strong>Sitemap XML</strong> - ~14%</li>
  <li><strong>Schema / JSON-LD</strong> - ~9%</li>
  <li><strong>Images / Media</strong> - ~6%</li>
  <li><strong>PDF / Docs</strong> - ~4%</li>
  <li><strong>robots.txt / llms.txt</strong> - ~2%</li>
</ul>
<p>The single biggest surprise in this data is <strong>how much AI bots care about structured feeds and discovery endpoints</strong> (RSS, sitemaps, robots.txt) relative to actual content pages. They're probing the shape of your site first, then deciding what to read. If your discovery layer is broken, the content layer barely matters.</p>

<h2>Finding 04 - ChatGPT, Claude, and Perplexity behave differently</h2>
<p><strong>ChatGPT-User:</strong> On-demand. Visits pages specifically when users ask about them. Strong preference for <strong>fresh content</strong>. Hits HTML directly more than feeds.</p>
<p><strong>Claude-Web:</strong> Thoughtful, slower. Reads fewer pages but <strong>spends longer on each</strong>. Higher retry rate on 404s. Respects rate limits unusually well.</p>
<p><strong>PerplexityBot:</strong> Aggressive, broad. Highest total volume. Crawls feeds and sitemaps first, then hits everything. Most likely to trigger rate limits on smaller sites.</p>
<p>Treating "AI bots" as a single category is a mistake. Each fetcher has distinct behavior, distinct priorities, and distinct failure modes. Optimization that helps one can actively hurt another - for example, aggressive rate-limiting to control Perplexity can accidentally block ChatGPT's infrequent-but-user-driven visits.</p>

<h2>Finding 05 - Fresh content compounds. Stale sites disappear (~3x)</h2>
<p>Sites that publish new content at least <strong>weekly</strong> get AI bot visits at roughly 3x the rate of sites that publish less often. This isn't controversial - it mirrors traditional Googlebot behavior. What's notable is how quickly the effect decays.</p>
<p>A site that goes silent for 60+ days sees its AI bot traffic drop to near-baseline levels within roughly two weeks. The bots don't stop checking entirely, but they check far less often and they pull far fewer pages when they do.</p>
<p>The implication: AI visibility is not a one-time optimization. It's a <strong>freshness signal that needs feeding</strong>. If you can't commit to regular publishing, other signals (link quality, schema completeness, authoritative backlinks) can only compensate so much.</p>

<h2>Finding 06 - ~30% of sites are blocking AI bots by accident</h2>
<p>When I added a fresh set of comparable sites to the network (beyond the instrumented 47), about <strong>3 in 10 were silently blocking at least one major AI bot</strong> without the owner realizing. The usual culprits:</p>
<p><strong>robots.txt with overzealous rules</strong> - often copy-pasted from an older template that predates AI bots. <strong>Cloudflare bot fight mode</strong> toggled on - blocks everything that isn't a verified search engine. <strong>WAF rules blocking user-agents</strong> with "bot" in the string. <strong>403s from aggressive rate-limiting</strong> - fine for human traffic, catastrophic for bots that expect retries.</p>
<p>The scariest part: none of these show up in Google Search Console or typical monitoring tools. You have to check the logs directly. Most site owners discover the problem months after it started, if at all.</p>

<h2>Finding 07 - AI visibility is mostly technical, not content (~70%)</h2>
<p>Looking across everything above - the feeds, the sitemaps, the accidental blocks, the rate limits, the crawl budgets, the structured data - about <strong>70% of what determines your AI visibility is technical infrastructure</strong>. Only about 30% is content quality, content freshness, and content relevance.</p>
<p>This inverts the standard SEO industry framing, which spends 80% of its effort on content and 20% on technical. For AI search, that ratio is backwards. If your technical foundation is broken, <em>the best content in your industry won't save you</em>. If your technical foundation is solid, modest content can still earn solid AI visibility.</p>
<p>The good news: technical problems are fixable. Most of them are fixable in a single engagement. Content is an ongoing commitment. Infrastructure is a project.</p>

<h2>What This Means For You</h2>
<p>Seven findings. Most contradict what the industry recommends.</p>
<p>If you've read this far and you're thinking "I have no idea whether any of this applies to my site", that's not a content problem - that's a measurement problem. You can't know without pulling your own logs and running the tests.</p>
<p>You can do that work yourself. The methodology is straightforward: get CDN or server log access, filter for AI user-agents, check which endpoints they're hitting, verify your feeds and sitemaps parse correctly, and audit your robots.txt against the current list of AI bots. It takes a skilled engineer roughly 40-60 hours to do thoroughly the first time.</p>
<p>Or you can <a href="https://mishamanko.com/ai-visibility">book the audit</a>, where I do all of that for a fixed price and hand back a prioritized fix list. Either way - <em>stop guessing what AI sees</em>.</p>

<hr/>
<p><em>Read the original at <a href="https://mishamanko.com/research/what-ai-bots-actually-read">mishamanko.com/research/what-ai-bots-actually-read</a> · <a href="https://mishamanko.com/what-ai-bots-actually-read.pdf">Download as PDF</a></em></p>
]]></content:encoded>
    </item>
  </channel>
</rss>
