Say Hello

SEMANTIC HTML
FOR AI.

Twelve items, prioritised by AI extraction impact, with before/after diff examples. Junior-developer friendly implementation, senior-impact on citation rate. Includes the heading hierarchy specifics.

Checklist Items
12
Quick Wins
5
Time Estimate
4-8 hr
Skill Level
Junior

WHY SEMANTIC HTML MATTERS FOR AI.

AI fetchers are aggressive at extracting content but conservative about ambiguity. When a page has clean semantic structure (proper <article>, <section>, descriptive headings), the bot identifies the body, the navigation, the metadata reliably. When a page is <div> soup, the bot guesses, sometimes incorrectly, and the citation goes to a more confidently-extractable competitor.

The 12-item checklist below covers the items that move citation rate. It is not the full HTML semantics spec; it is the working subset that matters for AI extraction.

THE CHECKLIST.

Twelve items in priority order:

Finding 01.
Heading specifics

HEADING HIERARCHY.

Most sites get headings wrong in one of two ways: too many H1s (one per visual block, treated as styling) or skipped levels (H1 jumps to H4 because of designer-driven sizes).

AI extraction relies on the heading hierarchy to identify the document's logical structure. One H1 = the page is about this. H2s = main sections. H3s = subsections. Non-monotonic hierarchies confuse the parser; some bots fall back to character-count heuristics, which usually misidentify sidebars or related-content blocks as primary content.

If your design system uses heading sizes inconsistently, decouple visual size from semantic level via CSS. h2 { font-size: 28px; } for one section, h2.bigger { font-size: 36px; } for another. The semantic level stays correct; the visual hierarchy is purely a styling choice.

THE FRAMEWORK DEFAULTS THAT BREAK THIS.

Common framework defaults that produce non-semantic HTML by accident:

BEFORE / AFTER EXAMPLE.

Before: <div class="main"><div class="hero"><div class="title">Welcome</div><div class="text">Long body...</div></div></div>

After: <main><article><header><h1>Welcome</h1></header><p>Long body...</p></article></main>

Same visual result with appropriate CSS. Massively different parser output. AI fetchers extract the second clean; the first they sometimes get right, sometimes attribute the title to a navigation block.

THE BOTTOM LINE.

Twelve checklist items, four to eight hours of refactoring on a typical site, junior-developer skill level. Pays back in extraction reliability across all four AI surfaces. Run the checklist against your top 20 most-trafficked pages first. Most templates fix in batch once you find the defect pattern.

Stop Guessing What AI Sees

MEASURE THE LEVERS
THAT ACTUALLY EXIST.

If you want this methodology applied to your specific site - your real logs, your real citation data, your real fix list - the audit is the productized way to do it.