Say Hello

THE SITEMAP
STRATEGY.

AI fetchers consume ~14% of their traffic on sitemap.xml (article N. 01). Most CMS-generated sitemaps are partially broken in ways that cost AI visibility. Five common defects, five fixes.

AI Bot Traffic
~14%
Common Errors
5
Setup Time
Minutes
Maintenance
Auto

WHY SITEMAPS MATTER FOR AI.

Sitemap.xml is the second-most-consumed endpoint by AI fetchers, after RSS. Bots use it to discover URLs and triangulate freshness. lastmod dates in the sitemap tell the bot "this page changed recently, fetch it." Bad lastmod data degrades the signal and makes the bot less efficient at finding fresh content on your site.

Most CMS-generated sitemaps have at least one of the five defects below. Fix them once, configure auto-regeneration, then never look at the sitemap again.

Finding 01.
Defect 01

INACCURATE LASTMOD.

The most common defect. Some CMS sitemaps set <lastmod> to today's date on every regenerate, regardless of whether the content actually changed. Result: every URL looks perpetually fresh, bots learn to ignore the signal, freshness benefit disappears.

Fix: <lastmod> should reflect the actual last-content-modification date. Most CMSes have a setting for this; check WordPress/RankMath settings, Shopify sitemap config, Ghost defaults.

Finding 02.
Defect 02

INCOMPLETE URL COVERAGE.

Many sitemap generators miss subsets of pages: paginated archives, category/tag pages, custom post types, manual standalone HTML. The result is partial discovery; AI bots end up using internal links to find what the sitemap should have surfaced directly.

Fix: audit the sitemap against the live site count. curl -s https://yoursite.com/sitemap.xml | grep -c '<loc>' vs the actual public-page count. Significant gap = sitemap is missing content. Sub-sitemaps for content types, then a sitemap-index.xml referring to them, is the cleanest pattern.

Finding 03.
Defect 03

NOINDEX URLS IN SITEMAP.

Sitemaps should only list pages you want indexed. Pages with <meta name="robots" content="noindex"> in the sitemap are a contradiction; bots eventually penalise the sitemap for the noise.

Fix: exclude noindex pages from sitemap generation. Common offenders: thank-you pages, private proposals, internal admin URLs, search-result pages. Cross-check noindex meta vs sitemap URLs quarterly.

Finding 04.
Defect 04

MISSING SITEMAP INDEX.

Sites with 1,000+ pages should split into multiple sitemaps with a sitemap-index.xml at the root. Single sitemaps over 50,000 URLs / 50MB are not officially valid. Many CMSes default to a single bloated sitemap.

Fix: split by content type (pages.xml, posts.xml, products.xml). Reference them from a sitemap-index.xml. Submit the index to BWT and Google Search Console. AI bots discover the children automatically.

Finding 05.
Defect 05

ROBOTS.TXT BLOCKING SITEMAP.

Rare but devastating: an overly broad Disallow: / in robots.txt that catches the sitemap path. The sitemap exists but bots cannot fetch it.

Fix: explicit Allow: /sitemap.xml in robots.txt, especially when other Disallow rules are in play. And a Sitemap: line at the bottom of robots.txt pointing at the canonical sitemap URL.

IMAGE AND VIDEO SITEMAPS.

Image and video sub-sitemaps are still useful for traditional search but rarely move the needle for AI bots specifically. Worth implementing if you have a media-heavy site (recipe sites, product catalogues, video-first content). Not worth implementing as an AI visibility intervention on a content site.

SUBMISSION.

Submit to: Bing Webmaster Tools (article N. 21), Google Search Console, IndexNow endpoint (which auto-pushes to Bing, Yandex, and others). The submission is one-time per sitemap; subsequent updates are auto-detected when bots revisit.

THE BOTTOM LINE.

Sitemaps are 14% of AI bot traffic and almost always underconfigured. Five defects, five fixes, all at the CMS-config level. Audit your sitemap quarterly. Make lastmod accurate; this single fix is the highest-leverage one.

Stop Guessing What AI Sees

MEASURE THE LEVERS
THAT ACTUALLY EXIST.

If you want this methodology applied to your specific site - your real logs, your real citation data, your real fix list - the audit is the productized way to do it.