AI Citation Tracking Tools Are Mostly Noise

THE PITCH AND THE PROBLEM.

A wave of "AI visibility tracking" tools shipped in 2025-2026. OtterlyAI, Profound, AthenaHQ, and a long tail of newer entrants. The pitch is the same across all of them: a single dashboard number for how your brand performs across ChatGPT, Claude, Perplexity, and Google AI Overviews. Finally, a metric for the new search era.

The dashboard does not measure what the marketing claims. The technology underneath these tools cannot support the metric they sell, for six independent reasons that compound when stacked. This article walks through each one and lands on what to do instead.

Caveat up front: I am not claiming the people building these tools are dishonest. Most of them are technically competent and aware of the limitations privately. The product story has just outrun the measurement reality. Investors, consultants, and customers want a number. The category will not exist without one. So the number gets shipped anyway.

Finding 01.

The stochasticity problem

ONE OBSERVATION IS NOISE, NOT DATA.

Large language models are stochastic by design. Ask ChatGPT the same question at 09:00 and again at 09:01 - different answers. Temperature sampling, top-p truncation, server-side routing between model variants, undocumented A/B experiments. All of this is variance the user never sees and the dashboard never accounts for.

To extract a real signal from this noise you would need to sample the same query dozens of times per cycle and average. That costs serious money - LLM inference at scale is not free. Most commercial tools sample thin (one to three trials per query per cycle) and report the result as a stable score. It is not stable. The dashboard is showing you one roll of a noisy die and labelling it a measurement.

If a tool will not disclose its sample size per query per cycle, the score it produces is closer to a single observation than a measurement. Ask the question. The answer is almost always small.

Finding 02.

The naked-query problem

REAL USERS CARRY CONTEXT. TOOLS DO NOT.

Your prospects are not querying ChatGPT the way these tools simulate. Real users have memory on, custom instructions configured, prior conversation context from earlier in the session, system prompts from whatever wrapper app they are using, and personalisation layered on by the platform itself.

The tracking tool fires naked queries from a fresh session, no memory, no instructions, no history. Whatever it is measuring, it is not the product your customer is consuming. It is measuring a stripped-down version of the same surface with substantially different output behaviour.

The gap matters. Personalisation alone can flip which sources get cited for the same query, because the engine optimises for what it believes the specific user wants. A clean-session measurement and a real-user experience are not the same data point.

Finding 03.

The model-routing problem

THE TOOL'S MODEL IS NOT YOUR CUSTOMER'S MODEL.

ChatGPT silently routes queries between underlying models based on user plan tier, query complexity, server load, and ongoing experiments. Perplexity has its own model selector that varies per query type. Claude's web product makes its own model decisions per request. None of this is exposed to the user.

The tracking tool, querying through an API or a stripped session, hits one model behaviour. Your prospect, hitting the consumer surface, may hit a different model behaviour for the same query. You are measuring product A; the customer experiences product B; the dashboard presents the score as if it covers both.

There is no public mapping between consumer-tier routing decisions and API-tier model behaviour. None of the major providers publishes this, and the routing changes over time without notice.

Finding 04.

The silent-swap problem

MODELS GET SWAPPED. YOU CANNOT TELL.

Foundation models are refreshed, fine-tuned, swapped, and rolled back without external notice. OpenAI rolls a new ChatGPT checkpoint quietly on a Tuesday. Anthropic ships an updated Claude variant overnight. Perplexity changes its underlying model selector silently. None of these are announced to your tracking dashboard.

When your "visibility score" jumps 14% on a Tuesday, you cannot tell if it was your content strategy or whether the provider quietly rolled a new model. The tool will obviously attribute it to your activity, because that is the story that keeps you subscribed. But the attribution is fiction.

This problem compounds with the stochasticity problem above. Even if you trusted the tool's sampling, you cannot disentangle your own changes from the platform's. The dashboard cannot tell you what moved the number, and neither can the vendor.

Finding 05.

The prompt-set problem

500 PROMPTS. MILLIONS OF QUERIES. NOT MEASUREMENT.

Tracking tools query a curated prompt set, usually 200 to 1,000 queries chosen by the vendor as "representative" of your category. Real user query distributions are stupidly long-tail. Millions of unique phrasings, each fired a handful of times, with the same underlying intent.

Synthesising 500 prompts and calling it a measurement of how AI sees your brand is like sticking a ruler in one puddle and reporting on sea level. It might correlate with the real signal if the puddle is well-chosen and the long-tail is roughly homogeneous. Most of the time, neither holds.

Worse: the prompt set is the vendor's theory of your customer's questions, not actual data from your customer's questions. Two different vendors will pick different prompt sets for the same brand and produce different scores, and there is no way to know which is closer to truth - because there is no ground truth to compare against.

Finding 06.

The extraction-stack problem

NOISE IN, NOISE OUT, NOISE EVERYWHERE.

Once the tool has its noisy LLM responses, it runs them through brand-mention extraction ("did the response cite Acme?") and sentiment classification ("was the mention positive, neutral, negative?"). Both are imperfect NLP processes with their own error rates.

Stochastic LLM output plus imperfect NLP parsing equals compounded noise downstream. The error bars on the final dashboard number are wider than the moves the dashboard claims to detect. A 5% change in your score is well inside the noise floor of the measurement stack itself.

None of this is theoretical. Run the same response through two different extraction pipelines and you will get different mention counts. Run the same mention through two different sentiment classifiers and you will get different polarity scores. The vendor picks one pipeline and reports its output as ground truth.

ONE OBSERVATION IS NOISE.
SIX SOURCES OF NOISE STACKED IS DASHBOARD THEATRE.

WHAT YOU SHOULD ACTUALLY DO.

The honest playbook for tracking AI citations until the measurement stack catches up:

Treat any commercial tool's score as a directional vibe, not a measurement. Useful for noticing big movements, useless for fine attribution.
Track manually. Pick 30-50 high-intent queries your audience would actually ask. Run them across ChatGPT, Claude, Perplexity, and Google AI weekly or monthly. Screenshot the citations. A consistent manual log beats any dashboard at this stage of the measurement stack.
Optimise input-side dimensions you can control. The 7-Dimension AI Visibility Score (N. 03) measures things that are stable and within your power to move: schema, RSS, robots.txt, freshness. Outputs are downstream of inputs.
Treat the audit as a quarterly snapshot, not a daily dashboard. A real measurement cadence for noisy systems is months, not minutes.

THREE QUESTIONS THAT SURFACE THE GAP.

If you are evaluating a commercial AI visibility tracking vendor, ask these three. The answers tell you whether the product claims line up with the technology reality.

1. What is your sample size per query per cycle? If it is under ten, the score is one observation away from being noise. If they will not disclose it, that is the answer.
2. How do you account for model routing in ChatGPT and Perplexity? If the answer is "we use the API," the tool is not measuring what consumers experience. If they cannot describe their model assumption, they have not thought about it.
3. How is your prompt set generated and refreshed? If it is a static vendor-curated list, the tool is measuring its own theory of your customer, not your actual customer. If it is generated from real query data, ask how they accessed real query data - that is a real question.

THE BOTTOM LINE.

Commercial AI citation tracking exists because investors and consultants need it to exist. The category got funded before the technology could support its claims, and the gap will not close until LLM providers expose sample-size, routing, and model-version data the tools currently fake their way around.

Until that changes, the honest move is to track manually, optimise input-side dimensions, and treat any dashboard number as a vibe rather than a measurement. Do not pay for noise dressed as a metric.

AI CITATION TRACKING
IS MOSTLY NOISE.

THE PITCH AND THE PROBLEM.

ONE OBSERVATION IS NOISE, NOT DATA.

REAL USERS CARRY CONTEXT. TOOLS DO NOT.

THE TOOL'S MODEL IS NOT YOUR CUSTOMER'S MODEL.

MODELS GET SWAPPED. YOU CANNOT TELL.

500 PROMPTS. MILLIONS OF QUERIES. NOT MEASUREMENT.

NOISE IN, NOISE OUT, NOISE EVERYWHERE.

WHAT YOU SHOULD ACTUALLY DO.

THREE QUESTIONS THAT SURFACE THE GAP.

THE BOTTOM LINE.

MEASURE THE LEVERS
THAT ACTUALLY EXIST.

THE PITCH AND THE PROBLEM.

ONE OBSERVATION IS NOISE, NOT DATA.

REAL USERS CARRY CONTEXT. TOOLS DO NOT.

THE TOOL'S MODEL IS NOT YOUR CUSTOMER'S MODEL.

MODELS GET SWAPPED. YOU CANNOT TELL.

500 PROMPTS. MILLIONS OF QUERIES. NOT MEASUREMENT.

NOISE IN, NOISE OUT, NOISE EVERYWHERE.

WHAT YOU SHOULD ACTUALLY DO.

THREE QUESTIONS THAT SURFACE THE GAP.

THE BOTTOM LINE.

MEASURE THE LEVERSTHAT ACTUALLY EXIST.

KEEP READING.

What The Ahrefs Schema Study Actually Shows (And Doesn't)

The 7-Dimension AI Visibility Score, Explained

What AI Bots Actually Read On Your Website

Schema Markup That AI Models Actually Use

MEASURE THE LEVERS
THAT ACTUALLY EXIST.