Why the Research on AI Citation Keeps Contradicting Itself

If you've been following the advice about AI visibility, especially if you're a practitioner whose expertise should speak for itself, you've probably noticed something: the advice keeps contradicting itself.

One study analyzes 129,000 websites and finds that traditional authority signals (backlinks, traffic, domain trust) are what drive AI citations. Schema markup doesn't even make the list.
Another study finds that schema improves citation accuracy by 30% in controlled conditions.
A third, analyzing 55,000+ queries, finds that 37% of domains cited by AI search engines don't appear in traditional search results at all, suggesting AI isn't just rewarding the same old authority signals.

This isn't bad advice from bad people. This is what the research actually looks like right now.

And the fact that so few people are telling you that is part of the problem.

What Are the Studies Actually Saying?

The major studies reach different conclusions because they measured different platforms, at different moments, using different methodologies. The contradictions aren't bad science. They're the predictable result of studying systems that keep changing.

Here's what happens when you line up the major research side by side instead of cherry-picking the one that supports what you're selling.

SE Ranking (December 2025) analyzed 129,000+ websites to identify what correlates with a domain being cited by ChatGPT (vendor research; full methodology not disclosed). They found 20 factors. The top-ranked: referring domains, total traffic, domain trust. Traditional authority signals, top to bottom. Schema markup? Completely absent from all 20 factors. Not weak. Not ranked low. Not there.

A large-scale study (arXiv, 2025) analyzing 55,936 queries across six AI search engines and two traditional search engines found something different. LLM-powered search engines cite with greater source diversity than Google: 37% of the domains they cited were unique to AI platforms and didn't appear in traditional search results. If traditional authority were the whole story, that number would be close to zero.

On schema specifically, the practitioner data paints a consistent picture at population scale, but with important nuance in controlled experiments. SearchAtlas found no measurable advantage. SALT.agency analyzed 107,352 websites in Google's AI Mode and called schema "a hygiene factor, not a differentiator." But Mark Williams-Cook ran an experiment in February 2026 where he put fake company data in invalid JSON-LD (not visible on the page) and both ChatGPT and Perplexity found and returned it. They read schema as plain text. And the AISO experiment found schema improved extraction accuracy by 30%, but only when reinforcing content already visible on the page. Schema-only content consistently failed extraction.

So does schema matter? At population scale, the data says no. In controlled experiments, the data says it depends. Both are measuring real things.

The uncomfortable truth: these studies aren't disagreeing because someone got it wrong. They're disagreeing because they measured different platforms, at different moments, using different methodologies. And the systems they're studying change constantly.

Why Does the Research Keep Contradicting Itself?

Different AI platforms use entirely different retrieval architectures, don't always search at all, produce non-repeatable outputs, and are biased toward copying retrieved text over using their own knowledge, even when that text is wrong. Contradictions between studies aren't a flaw. They're inevitable.

The contradictions aren't a sign of bad science. They're a predictable consequence of how these systems actually work.

There is no single system to optimize for. ChatGPT retrieves results through Bing. Claude uses Brave Search, a completely different index with different ranking signals. Gemini uses Google Search. Perplexity runs its own retrieval-first architecture. DeepSeek has historically relied on parametric knowledge with limited external citation capability. A study measuring ChatGPT citation behavior is studying a fundamentally different system than one measuring Perplexity.

The systems don't always search. A study analyzing approximately 14,000 real AI conversations found that 24% of GPT-4o responses and 34% of Gemini responses use no web search at all. The model answers from memory, with no attribution possible. Gemini provides no clickable citation in 92% of its answers (Strauss, arXiv:2508.00838, 2025). When AI doesn't look anything up, there's nothing to cite. Your content could be perfectly optimized and it wouldn't matter for that query.

When they do cite, it's not stable. LLM outputs are probabilistic. Same question, different time, different user, different answer. SparkToro research (nearly 3,000 prompt runs across ChatGPT, Claude, and Google AI) found less than a 1-in-100 chance that the same AI returns the same brand recommendation list twice (Fishkin & O'Donnell, January 2026). Notably, while rankings were chaotic, visibility was more stable: some brands appeared in 60-90% of responses for a given intent, even though their position varied wildly. That's not a measurement problem you can solve with a better tool. That's the nature of the system.

The systems are architecturally biased toward copying. A Chalmers University study published at EMNLP 2024 found that RAG systems show higher confidence when copying from retrieved context (0.95-0.98) than when relying on their own correct parametric knowledge (0.51-0.84), even when the retrieved context is factually wrong (Farahani & Johansson, arXiv:2410.05162). This suggests citation behavior can reflect retrieval mechanics as much as content quality. This is a peer-reviewed finding that should give pause to anyone building a strategy around "getting cited."

Most of what AI "knows" about you is invisible. There are roughly three tiers of how AI uses information: explicitly cited (you can see it), anonymously mentioned (you can't), and parametrically absorbed into the model's training data (permanently unmeasurable). Every measurement tool tracks only the first tier, the smallest one.

And here's the part nobody likes: nearly 20% of GPT-4o's citations are entirely fabricated. The system confidently cites sources that don't exist, according to a Deakin University study published in JMIR Mental Health (Linardon et al., November 2025). So even the visible, measurable tier isn't fully reliable.

The punchline: When the systems themselves work this differently: different search backends, different retrieval architectures, different citation behaviors, constant updates — studies measuring different platforms at different moments should reach different conclusions. The contradictions aren't a bug. They're what honest measurement looks like in a fragmented, non-deterministic landscape.

What Actually Holds Up Across the Contradictions?

Four principles show up in every study regardless of methodology: substantive content with clear structure, presence beyond your own website, entity clarity and consistency, and freshness. These aren't new rules. They're the same authority fundamentals. With higher stakes.

Stop looking at which factor ranks first. Start looking at what's consistent across the contradictions.

When you lay the studies next to each other instead of against each other, patterns emerge. Not specific factor rankings. Those shift. But principles that keep showing up regardless of which study you read.

Substantive content with clear structure. Every study, every practitioner analysis, every reverse-engineering effort reaches the same place: content that directly answers questions, uses clear headings, and organizes information in self-contained sections tends to get cited more. Search Engine Land's analysis found that 72.4% of blog posts cited by ChatGPT contained an "answer capsule": a concise, self-contained answer placed directly after a heading, with no links inside (Adam Gnuse, Search Engine Land). The specifics of how much more vary by study. The direction doesn't.

Being present beyond your own website. Whether it's called "referring domains" (SE Ranking's #1 factor), "brand mentions," or "community signals" (Perplexity's apparent preference for Reddit and review platforms), the principle is the same. AI systems cross-reference. The SE Ranking study found that sites present on review platforms like Trustpilot, G2, or Capterra averaged 4.6-6.3 citations, compared to 1.8 for those absent. If the only place you exist is your own site, you're harder to verify and easier to skip.

Entity clarity and consistency. AI needs to accurately identify who you are before it can cite you. When your About page, your LinkedIn, your directory listings, and your author bios tell a consistent story, that builds the kind of signal these systems use for verification. This isn't about one magic schema property. It's about visible consistency across touchpoints.

Freshness. The SE Ranking study found that content updated within three months averaged approximately 6 citations, compared to 3.6 for content that hadn't been updated recently, though freshness may also be a proxy for other signals, since actively maintained sites tend to have more backlinks and traffic. Still, the pattern shows up across multiple analyses: content that's actively maintained performs better than content that's left to age.

These are the fundamentals that have historically built authority. Substantive content. Presence beyond your own site. Consistent identity. Keeping things current. The AI layer didn't invent new rules. It raised the stakes on the old ones.

What Does This Mean If You're an Expert Who Feels Invisible?

Stop chasing specific tactics and build the fundamentals instead. You can't optimize for a target that shifts every quarter, but you can close the gap between what you've actually built and what someone searching can actually find.

If you're someone with deep expertise who keeps watching less qualified people get the visibility, this is actually clarifying news.

You can't optimize for a target that moves every quarter. If you're waiting for the definitive study that tells you exactly which factor to prioritize, you'll be waiting forever. The studies will keep contradicting each other because the systems keep changing.

You can build the fundamentals that hold up regardless. The practitioners who'll be visible in two years aren't the ones who picked the right tactic in February 2026. They're the ones who built genuine authority across the board: substantive content, real presence beyond their own site, clear identity, consistent updates.

Most specialists have gaps they can't see. Not because they lack expertise, but because expertise and visibility are different skills. You might have decades of knowledge and a digital presence that captures only a fraction of what you've actually built. Or a website that accurately represents what you do but exists in isolation, with no external signals pointing back to it.

The gap between what you've actually built and what someone searching can actually find? That's where the work is.

The Honest Ending

I don't know which study will be right next quarter. I don't know whether schema will matter more or less by summer. I don't know which platform will change its retrieval architecture next.

I do know what's been consistent across every study I've read: the same types of signals that have historically built authority keep showing up. Substance. Presence. Clarity. Freshness.

If you have real expertise and real substance, the question isn't whether you should be visible to AI. It's whether your digital presence actually reflects what you bring.

That's a gap worth closing, and it doesn't require waiting for the research to settle.

If this resonated and you want to see the gap between what your work demonstrates and what someone actually finds, that's what the Brand Authority Diagnostic is for: an honest map of where your expertise is strong, where it's inaccessible, and what to do about it in what order.

Does schema markup help with AI citations?

At population scale, the data says no. Multiple large-sample analyses found no measurable advantage from schema markup. But in controlled experiments, schema improved extraction accuracy by 30% when reinforcing content already visible on the page. The key finding: AI systems read schema as plain text, not as the semantic structure it was designed to provide. Schema-only content consistently fails extraction. Treat schema as baseline infrastructure that may help AI notice what's already on your page, not as a visibility driver on its own.

Which AI search engine should I optimize for?

None of them specifically, and that's the point. ChatGPT uses Bing, Claude uses Brave Search, Gemini uses Google Search, and Perplexity runs its own retrieval system. A strategy optimized for one platform's retrieval mechanics may not transfer to another. The practitioners who'll be visible across platforms are the ones building genuine authority fundamentals: substantive content, external presence, entity consistency, and freshness.

Can I measure my AI visibility?

Only partially. Every measurement tool tracks explicit citations, the smallest tier of how AI uses information. AI also mentions sources anonymously and absorbs information into training data permanently, neither of which is measurable. On top of that, nearly 20% of GPT-4o's explicit citations are fabricated, and the same prompt produces different results almost every time. You can track directional trends, but don't mistake any single measurement for ground truth.

How often should I update content for AI visibility?

The SE Ranking study found that content updated within three months averaged approximately 6 citations, compared to 3.6 for older content, though freshness may also be a proxy for other signals like backlinks and traffic. The principle is consistent across studies: actively maintained content outperforms content left to age. Focus on keeping your highest-value pages current rather than updating everything on a rigid schedule.

Sources

SE Ranking analysis of 129,000+ websites (December 2025), as reported by GEOReport.ai
Strauss, I. (2025). "The Attribution Crisis in LLM Search Results." arXiv:2508.00838. ~14,000 real-world conversations.
arXiv: "Source Coverage and Citation Bias in LLM-based vs. Traditional Search Engines" (2025). 55,936 queries across 6 LLM search engines and 2 traditional search engines.
Farahani, M. & Johansson, R. (2024). "Deciphering the Interplay of Parametric and Non-parametric Memory in Retrieval-augmented Language Models." EMNLP 2024. arXiv:2410.05162.
Linardon, J. et al. (2025). "Influence of Topic Familiarity and Prompt Specificity on Citation Fabrication." JMIR Mental Health, 12, e80371. DOI: 10.2196/80371.
Fishkin, R. & O'Donnell, P. (2026). "AIs are highly inconsistent when recommending brands or products." SparkToro. ~3,000 prompt runs.
SearchAtlas: schema markup and AI citation analysis (December 2025).
SALT.agency: analysis of 107,352 websites in Google AI Mode citations (September 2025).
Williams-Cook, M. (2026). Invalid JSON-LD schema extraction experiment. Search Engine Roundtable, February 2026.
Yang, A. (2025). "How Schema Markup Might Actually Work in AI Search." Medium, December 2025. References SearchVIU (0/5 extraction) and AISO (+30% with visible content) experiments.
Gnuse, A. "How to get cited by ChatGPT: the content traits LLMs quote most." Search Engine Land. Answer capsule analysis.

Every data point in this article reflects what one study found at a specific point in time. The field is moving fast. Treat everything here, including the contradictions, as a snapshot, not a verdict.