A new study by researcher Kosta Jordanov at Lenz Research found that leading AI models often gave different verdicts when reviewing the same factual claims. The study tested GPT-5.4, Claude Opus 4.7, Gemini 3 Pro, Gemini 3 Pro with Search, and Sonar Pro on 1,000 claims submitted by users to a fact-checking platform.
Each model had to label each claim as true, mostly true, misleading, or false. The models disagreed on 672 of the 1,000 claims. In 34% of the claims, the disagreement was severe, with one model labeling a claim true while another labeled it false.
Why It Matters: Organizations are adding AI to workflows where accuracy and trust matter. The study shows that even leading models can produce different answers when asked to assess the same claim, reinforcing the importance of source tracking and governance controls.
- The Models Disagreed on Most Claims Tested: The study found disagreement across 672 of the 1,000 claims, meaning at least one model broke with the majority verdict in roughly two-thirds of the test set. For organizations using AI in compliance review or internal knowledge search, that level of inconsistency creates a clear reason for caution. A single model response may appear confident, while another leading system would classify the same claim differently. In business settings, that can affect how teams assess policy language and risk. The finding also makes it harder to treat model confidence as a substitute for evidence.
- Some Disagreements Produced Opposite Answers: In 34% of claims, one model labeled a claim True while another labeled it False. Such a split can create confusion in any workflow where AI is used to verify information before a person acts on it. In cases reviewed by the study, the models were evaluating the same material, yet they assigned different truth labels, which shows how fragile AI-assisted verification can be when context matters.
- The Test Used User-Submitted Claims Instead of a Standard Benchmark: The study relied on claims submitted to a fact-checking platform, making the results relevant to how AI is often used within organizations. Employees rarely ask AI tools only clean questions with settled answer keys. They bring in claims pulled from various sources, which may depend on jurisdiction or missing context. A model that performs well on public tests may still struggle when information is fragmented across records or when the answer depends on careful interpretation. That makes governance more important for teams deciding where AI can assist and where expert review remains necessary.
- Stuggle With Nuanced Verdicts: The five models agreed on only 328 of the 1,000 claims, and their full agreement appeared mostly when the answer landed at True or False. Only four claims received a unanimous Misleading label, and none received a unanimous Mostly True label. That pattern matters for enterprise use because many work questions sit in the gray area. The study suggests that AI agreement becomes weaker when the answer requires qualification rather than a simple verdict.
- Stronger AI Governance Before Deployment: The findings do not argue that AI tools have no value in verification work. It shows that organizations need clear controls for how factual answers are reviewed and used. Teams should require citations and define when a subject matter expert must review the answer. AI can still help gather information and surface useful context, but its verdicts should be treated as inputs within a governed workflow.
Go Deeper -> AI Models Can’t Agree on Basic Facts Most of the Time, Study Shows – Decrypt
Trusted insights for technology leaders
Our readers are CIOs, CTOs, and senior IT executives who rely on The National CIO Review for smart, curated takes on the trends shaping the enterprise, from GenAI to cybersecurity and beyond.
Subscribe to our 4x a week newsletter to keep up with the insights that matter.


