Ask five leading AIs whether a claim is true, and chances are they’ll disagree — often loudly. That’s the headline from a new study by Kosta Jordanov at Lenz Research, which tested five frontier models on 1,000 real-world fact-check claims submitted by users and found widespread and sometimes dramatic disagreement. The experiment asked GPT-5.4, Claude Opus 4.7, Gemini 3 Pro, Gemini 3 Pro with Search, and Sonar Pro to label each claim as true, mostly true, misleading, or false. On 672 out of 1,000 claims at least one model cast a dissenting vote from the majority. In roughly one-third of cases (34%), disagreement was extreme: one model labeled a claim true while another labeled it false. Crucially, these were not sanitized benchmark questions with clear answer keys. The claims came from actual users of Lenz’s fact-checking platform — “jagged, ambiguous” real-world material unlikely to have a canonical gold label in any training corpus. That design undercuts the common explanation that models only fail on leaked test sets or memorized benchmarks. The statistical picture is telling: Krippendorff’s alpha, a standard measure of inter-rater agreement, was 0.639 (where 1.0 is perfect agreement and 0 is random). The study calls this “nontrivial but limited agreement”; by common standards anything under 0.8 is considered weak. When all five models did agree — only 328 claims — unanimity clustered at the extremes. The “nuance” buckets nearly vanished: just four claims were unanimously judged “misleading,” and none were unanimously “mostly true.” Concrete examples show how consequential the splits can be. The claim “The World Bank’s active portfolio in Nigeria stands over $16.4 billion as of 2025” drew “mostly true” from GPT-5.4, “false” from Gemini 3 Pro, and “misleading” from Gemini 3 Pro + Search. On the politically charged claim “Donald Trump said that an attack on Iran was postponed at the request of Gulf allies,” GPT-5.4 said false, Claude Opus 4.7 said mostly true, Gemini 3 Pro said false, and Gemini 3 Pro + Search said true. The study’s core takeaway: these models aren’t just hallucinating wild facts (that’s a known problem). They also fail to converge on basic factual judgments about the same material. “A majority of frontier models is not ground truth,” the researchers warn — the majority can be wrong, and a lone dissenting model can sometimes be right. But without a built-in tie-breaker or consistent arbitration, disagreement means at least one model’s verdict is label-inconsistent under the four-label rubric. Why this matters for crypto audiences: crypto communities frequently lean on LLMs for quick due diligence, on-chain analysis, research synthesis, and rumor-checking. If leading models give conflicting verdicts on factual claims, relying on a single LLM for investment or policy decisions introduces real risk. The disappearance of “mostly true” consensus also signals that AIs struggle with nuance — precisely the gray areas that often determine market-moving interpretations. Practical takeaways for crypto readers: - Don’t trust a single model: cross-check claims across multiple models and with primary sources. - Prioritize on-chain and primary data (block explorers, smart-contract reads) over AI summaries. - Treat AI verdicts as signals, not seals of truth — use human review for high-stakes decisions. - Demand transparent, auditable fact-checking processes from AI vendors and services. The Lenz study is a reminder that while AI is getting more capable, it’s not yet a reliable, unified arbiter of truth — especially on the messy, ambiguous claims that matter in crypto markets. Use these tools, but keep a skeptical, source-first workflow. Read more AI-generated news on: undefined/news