WHY SEMANTIC HTML MATTERS FOR AI.
AI fetchers are aggressive at extracting content but conservative about ambiguity. When a page has clean semantic structure (proper <article>, <section>, descriptive headings), the bot identifies the body, the navigation, the metadata reliably. When a page is <div> soup, the bot guesses, sometimes incorrectly, and the citation goes to a more confidently-extractable competitor.
The 12-item checklist below covers the items that move citation rate. It is not the full HTML semantics spec; it is the working subset that matters for AI extraction.
THE CHECKLIST.
Twelve items in priority order:
- 1. One
<h1>per page. Reflects the page's primary topic. Not a logo. Not a section heading. - 2. Heading hierarchy is monotonic. H1 -> H2 -> H3. No skipping levels. No styling-driven swaps (e.g.,
<h2>used for visual size with no logical role). - 3. Body content in
<article>or<main>. Not in<div class="content">. The semantic tag is the hint to the parser. - 4. Sections in
<section>. Each section has its own heading. Sections with no heading are aside content, not sections. - 5. Navigation in
<nav>. So the parser can ignore it for content extraction. - 6. Header / footer in
<header>/<footer>. Same reason. - 7. Each page has a unique
<meta name="description">. Not the site default duplicated everywhere. - 8. Open Graph completeness. Per article N. 04 plus
og:typematching the page (article, website, product). - 9. Images have
altattributes. Descriptive, not "image123.jpg". - 10. Links have descriptive anchor text. Not "click here". The anchor is a signal of what the destination is about.
- 11. Lists in
<ul>/<ol>. Not<br>-separated paragraphs. - 12. Code blocks in
<pre><code>. Inline code in<code>. Helps AI distinguish prose from technical content.
HEADING HIERARCHY.
Most sites get headings wrong in one of two ways: too many H1s (one per visual block, treated as styling) or skipped levels (H1 jumps to H4 because of designer-driven sizes).
AI extraction relies on the heading hierarchy to identify the document's logical structure. One H1 = the page is about this. H2s = main sections. H3s = subsections. Non-monotonic hierarchies confuse the parser; some bots fall back to character-count heuristics, which usually misidentify sidebars or related-content blocks as primary content.
If your design system uses heading sizes inconsistently, decouple visual size from semantic level via CSS. h2 { font-size: 28px; } for one section, h2.bigger { font-size: 36px; } for another. The semantic level stays correct; the visual hierarchy is purely a styling choice.
THE FRAMEWORK DEFAULTS THAT BREAK THIS.
Common framework defaults that produce non-semantic HTML by accident:
- Component libraries that wrap everything in
<div>(e.g., default Material-UI, default Bootstrap layouts). - Page builders (Elementor, Divi) that emit
<section class="section">with no heading - a sign of style-driven section labels. - Headless CMS templates that flatten Markdown body content into
<div>instead of<article>. - Marketing-page generators (some Webflow templates, some Framer setups) that produce nested
<div>trees with no semantic tags at all.
BEFORE / AFTER EXAMPLE.
Before: <div class="main"><div class="hero"><div class="title">Welcome</div><div class="text">Long body...</div></div></div>
After: <main><article><header><h1>Welcome</h1></header><p>Long body...</p></article></main>
Same visual result with appropriate CSS. Massively different parser output. AI fetchers extract the second clean; the first they sometimes get right, sometimes attribute the title to a navigation block.
THE BOTTOM LINE.
Twelve checklist items, four to eight hours of refactoring on a typical site, junior-developer skill level. Pays back in extraction reliability across all four AI surfaces. Run the checklist against your top 20 most-trafficked pages first. Most templates fix in batch once you find the defect pattern.