What The Ahrefs Schema Study Actually Shows (And Doesn't)

WHAT THE STUDY DID, AND WHAT IT CLAIMED.

In May 2026, Ahrefs published a study titled along the lines of "adding schema did not boost AI citations." The team (Louise Linehan, reviewed by Ryan Law) pulled around six million URLs cited in AI Overviews, identified 1,885 pages that transitioned from no JSON-LD to having JSON-LD between August 2025 and March 2026, and matched each treated page against three control pages from different domains with similar pre-period citation levels. They then ran a matched difference-in-differences test on citation counts 30 days before vs 30 days after the schema was added.

Headline result: Google AIO citations declined by 4.6% relative to controls (statistically significant, roughly 1-in-2,500 odds of being noise). Google AI Mode and ChatGPT showed +2.4% and +2.2% respectively, neither statistically distinguishable from zero. Headline conclusion as carried in the article and the social posts: schema does not lift citations on any AI surface.

This is the most rigorous public study of schema-vs-AI citations I have seen in 2026. The design is honest, the authors flag most of their own limitations, and the result is reported without spin. I want to be clear up front that this is not a hit piece on the study. It is a piece about the gap between what the data can support and what the headline implies, because that gap matters when readers act on the conclusion.

Finding 01.

The selection-criterion problem

THEY STUDIED PAGES THAT WERE ALREADY WINNING.

From the study itself: "Every page in the dataset had 100+ AI Overview citations in February 2025, before any schema was added." Every page in the treatment and control arms was already being cited at high volume before the experiment started.

Schema's primary mechanism, especially for entity disambiguation and rich result eligibility, is most likely to help pages that are not yet recognised. The 100-plus citation floor screens out the entire population the intervention is theoretically supposed to lift. It is closer to a study of "do business cards help job-seekers who already have a job" than a study of whether business cards help job-seekers in general.

The authors do acknowledge this in their framing: "For pages already getting picked up, our data suggests that adding schema isn't going to push it higher." But the headline as carried into the discourse drops the caveat and reads as "schema does not help citations." Two very different claims. The first is defensible. The second is not what the data can support.

Finding 02.

The confounder problem

PAGES THAT ADD SCHEMA CHANGE OTHER THINGS TOO.

The authors flag this themselves: "Pages that add JSON-LD often change other things at the same time (e.g. links, content, technical fixes)." The matching strategy controls for pre-period citation level only. It does not control for content quality changes, backlink velocity, on-page rewrites, redesigns, freshness updates, technical fixes shipped in the same release, or anything else that the page's owners may have done concurrently.

A real causal estimate of schema's effect would require propensity-score matching across these dimensions, or, better, a true randomised experiment where some pages get schema and others get an identical treatment-shaped placebo. Neither is what was run. What was run is closer to an observational comparison with one control variable.

The honest read is that the −4.6% AIO movement is a correlated outcome of "adding schema and probably doing other things at the same time," not an isolated estimate of "schema's effect on AIO citations." The authors are careful enough to say so. The headline is not.

Finding 03.

The control-matching problem

DIFFERENT DOMAINS, SIMILAR CITATIONS, NOT THE SAME THING.

Control matching ran "3 control URLs (from different domains, with similar pre-period citation levels) that had never added JSON-LD." Two pages can both have 150 AIO citations in February 2025 and live in completely different visibility regimes. One could be on wikipedia.org coasting on entity prestige. The other could be on a mid-tier blog with a single viral piece. They cite identically and behave nothing alike.

A stronger control strategy would match on topic cluster, domain authority quartile, content age, and post-period crawl frequency at minimum. Single-variable matching on "citations" without those layers leaves a lot of variance unaccounted for. Whatever moved the treatment arm 4.6% relative to controls may simply reflect that the treated pages and the controls were not as comparable as the matching assumed.

None of this is unique to this study. Most large-scale observational SEO research has the same problem. But the strength of the causal claim should scale with the strength of the matching, and the matching here is light.

Finding 04.

The pooled-types problem

ARTICLE + FAQ + PRODUCT + HOWTO IS NOT ONE TREATMENT.

From the study: "We pooled all schema types together. Article, FAQ, Product, HowTo, Organization. It's possible some types help more than others." These schema types do completely different work. Product schema signals commerce-surface eligibility. FAQ schema signals question-answer pairing. HowTo signals task structure. Organization schema feeds the entity graph.

Pooling them and asking "does schema work" is the structural equivalent of pooling B12, zinc, fish oil, and creatine and asking "do supplements work." The answer at the pooled level is meaningless. Each schema type is a separate hypothesis with a separate likely effect size and a separate population that benefits.

The authors flag this and note it deserves a separate analysis. That separate analysis is the one practitioners actually need before drawing conclusions. The pooled result tells you the average across a basket of interventions that should not be averaged.

Finding 05.

The platform-pattern problem

ONE SIGNIFICANT NEGATIVE, TWO NULLS. THAT IS A CONFOUND PATTERN.

The three platform results: Google AIO −4.6% (significant), AI Mode +2.4% (null), ChatGPT +2.2% (null). If schema truly had zero causal effect, you would expect roughly symmetric noise across all three. Getting one significant negative and two slightly positive nulls is more consistent with one platform's measurement being more sensitive to whatever confounds the schema addition is correlated with than with schema actively reducing AIO citations.

Google's AIO citation logic is the most opaque and most actively-tweaked of the three. Concurrent page changes (the very confound the authors flag) are more likely to register there. The AI Mode and ChatGPT surfaces have different selection criteria and may simply be insulated from whatever moved AIO over the same window.

The authors are honest about this: "The fact that treated pages declined slightly more suggests schema had a small negative effect - but it could also reflect other factors. We can't tell which one it is from this data alone." The piece's framing is appropriately hedged. The discourse around it is not. The 4.6% gets quoted as if it were a measurement of schema. It is a measurement of "the bundle of things that happen to a page when its owners add schema."

Finding 06.

The window problem

30 DAYS IS NOT ENOUGH FOR AN INDEX EFFECT.

Schema effects propagate through recrawl, parsing, validation, eligibility for new surface types, and entity graph updates. None of those are instantaneous. Recrawl alone can take days to weeks on lower-authority domains, and entity updates can take longer still. A 30-day post-treatment window may not catch the lift even if it exists.

The study did not separately verify that the schema was being parsed correctly by Google or by the other engines under test. A "script type application slash ld plus json" tag on a page is a necessary condition for schema to be consumed, not a sufficient one. Mis-typed properties, missing required fields, validation errors, JavaScript injection patterns the engine doesn't render: any of these mean the schema is present from the crawler's perspective but useless from the consumer's.

The study explicitly excludes one subset (JavaScript-injected schema) but does not screen for validation correctness on the rest. The treatment group is technically "pages where the bytes appeared" rather than "pages where the structured data was consumed." The difference matters at the margin, especially over a 30-day window.

Finding 07.

The query-stratification problem

NOT ALL CITATIONS COME FROM THE SAME QUERIES.

The study tracked total citation counts in 30-day windows. It did not stratify by query type. Schema's biggest documented lift is on query-type matches: FAQ schema disproportionately helps FAQ-style queries, Product schema helps comparison and "best" queries, HowTo helps task and tutorial queries.

An aggregate citation count across a mixed query basket can show net zero movement even when a specific schema type is meaningfully lifting a specific query type, because the lift is being averaged with a basket of queries the schema is irrelevant to. Without stratification, you cannot distinguish "schema does nothing" from "schema helps the queries it's designed for and the rest dilute the signal."

This is the kind of cut that would require knowing what queries each citation came from, which is exactly the kind of data only Google itself has. The study is constrained by what was available. The constraint should bound the conclusion, not the headline.

Finding 08.

The metric-validity problem

CITATION COUNT IS A VANITY METRIC.

Citation count weights all citations equally. A citation on a high-intent commercial query is worth a hundred citations on long-tail informational queries. A citation that includes the brand name as the answer is worth more than a citation buried in a footnote. None of this is captured in "how many times the page was cited."

A real measure of "AI visibility lift from schema" would weight by query intent, position within the cited list, and whether the brand was named in the answer text or just appeared in the source list. Building that is hard. Building it across three platforms with millions of URLs is harder. So the study uses the proxy that was available: raw citation count. Fair choice, but the metric's noise floor is much higher than the 2-4% movements being attributed to it.

When a study's headline movement is well inside the noise floor of the metric itself, the conclusion should be framed as "no clear signal" rather than "no effect." The Ahrefs piece is mostly careful here, but the social-channel framing flattens the distinction.

Finding 09.

The publisher-context problem

THE STUDY IS HOUSE-BUILT. READ ACCORDINGLY.

The byline is Louise Linehan (Content Marketer at Ahrefs), reviewed by Ryan Law (Director of Content Marketing at Ahrefs). Both are Ahrefs staff. The data and the framing come from a single commercial vendor whose toolset emphasises backlinks and content, not structured data. This is not a disqualifying problem. It is a lens.

Ahrefs has a commercial interest in a clear, simple narrative for the AI search era, and "schema does not matter, fundamentals do" is a narrative that flatters the parts of their toolset that are strongest. None of this means the data is dishonest. It does mean the choice of question ("does schema work?"), the choice of pooled-types analysis, and the headline framing all happened inside a commercial context that has an opinion about the answer.

The right move for readers is to treat the data as good-faith but to source independent replication before treating the conclusion as established. As of mid-2026, no second large-scale study with matched DiD has replicated or contradicted this one. One observational study, however carefully run, is one observation.

THE 4.6% IS REAL.
WHAT IT MEASURES IS NOT WHAT THE HEADLINE SAYS.

WHAT THE STUDY CAN AND CANNOT CLAIM.

Reading the data without the editorial overlay, here is what the Ahrefs study supports and what it does not.

Can claim: For pages already cited 100+ times in AI Overviews, adding pooled-type JSON-LD within 30 days did not measurably increase citation count and may have correlated with a small AIO dip that is confounded with concurrent page changes.
Can claim: The headline "schema is a free citations boost" pitch sold by some GEO consultants is not supported by this data and probably overstates the size of any effect.
Cannot claim: Schema does not help AI citations in general. The study's sample explicitly excludes the population most likely to benefit (pages with low or zero prior visibility).
Cannot claim: Any specific schema type (FAQ, Product, HowTo, Organization) does or does not work. The pooled design cannot separate them.
Cannot claim: Schema is irrelevant to entity recognition or rich-surface eligibility, neither of which were measured.
Cannot claim: Effects would be the same in a 90-day or 180-day window. The 30-day cut is probably too short for an index-side intervention.

WHAT PRACTITIONERS SHOULD TAKE FROM IT.

The data is genuinely useful, just not in the way the headline frames it. The right reads:

If your page is already getting AI citations, do not expect adding schema to be the lever that 10x's them. Other interventions (content quality, internal linking, entity reinforcement) are likely higher-leverage at that stage.
If your page is not getting AI citations yet, this study tells you nothing. The population it studied was not yours. Schema may still be the difference between being eligible for a surface and not.
Treat schema as eligibility, not boost. Schema's job is to make a page parseable in ways the engine can use. It is necessary infrastructure for some surfaces, not a citation multiplier on already-cited pages. See N. 04 for what schema actually does.
Match the intervention to the page's stage. Pre-visibility pages need entity signals and rich-surface eligibility. Already-visible pages need content quality and internal-graph structure. A study sampled only on the second group cannot speak to the first.
Wait for replication before treating any single study as gospel. One observational study, no matter how careful, is one observation. The schema-vs-AI-citations question is not settled by this paper, and was not before it.

THE BOTTOM LINE.

The Ahrefs study is one of the better pieces of public AI search research published in 2026. The design is honest, the authors flag their limits, and the framing inside the article is appropriately hedged. The problem is not the study. The problem is the gap between what the data can support and how the conclusion gets carried in tweets, threads, and downstream content articles that flatten the caveats and ship the headline.

Schema's role in AI search is still open. This study constrains one slice of the question: it tells you adding pooled-type schema to already-cited pages does not appear to lift them further over 30 days. That is a useful but narrow result. Do not read it as a verdict on schema in general, because the data cannot support that read.

And if you are tempted to remove existing schema from your site based on a 4.6% AIO dip in a confounded matched DiD on already-winning pages: don't.

WHAT THE AHREFS
SCHEMA STUDY ACTUALLY
SHOWS.

WHAT THE STUDY DID, AND WHAT IT CLAIMED.

THEY STUDIED PAGES THAT WERE ALREADY WINNING.

PAGES THAT ADD SCHEMA CHANGE OTHER THINGS TOO.

DIFFERENT DOMAINS, SIMILAR CITATIONS, NOT THE SAME THING.

ARTICLE + FAQ + PRODUCT + HOWTO IS NOT ONE TREATMENT.

ONE SIGNIFICANT NEGATIVE, TWO NULLS. THAT IS A CONFOUND PATTERN.

30 DAYS IS NOT ENOUGH FOR AN INDEX EFFECT.

NOT ALL CITATIONS COME FROM THE SAME QUERIES.

CITATION COUNT IS A VANITY METRIC.

THE STUDY IS HOUSE-BUILT. READ ACCORDINGLY.

WHAT THE STUDY CAN AND CANNOT CLAIM.

WHAT PRACTITIONERS SHOULD TAKE FROM IT.

THE BOTTOM LINE.

MEASURE THE LEVERS
THAT ACTUALLY EXIST.

WHAT THE STUDY DID, AND WHAT IT CLAIMED.

THEY STUDIED PAGES THAT WERE ALREADY WINNING.

PAGES THAT ADD SCHEMA CHANGE OTHER THINGS TOO.

DIFFERENT DOMAINS, SIMILAR CITATIONS, NOT THE SAME THING.

ARTICLE + FAQ + PRODUCT + HOWTO IS NOT ONE TREATMENT.

ONE SIGNIFICANT NEGATIVE, TWO NULLS. THAT IS A CONFOUND PATTERN.

30 DAYS IS NOT ENOUGH FOR AN INDEX EFFECT.

NOT ALL CITATIONS COME FROM THE SAME QUERIES.

CITATION COUNT IS A VANITY METRIC.

THE STUDY IS HOUSE-BUILT. READ ACCORDINGLY.

WHAT THE STUDY CAN AND CANNOT CLAIM.

WHAT PRACTITIONERS SHOULD TAKE FROM IT.

THE BOTTOM LINE.

MEASURE THE LEVERSTHAT ACTUALLY EXIST.

KEEP READING.

Schema Markup That AI Models Actually Use

AI Citation Tracking Tools Are Mostly Noise

The 7-Dimension AI Visibility Score, Explained

What AI Bots Actually Read On Your Website

MEASURE THE LEVERS
THAT ACTUALLY EXIST.