Viral BridgeBench post claims that Claude Opus 4.6 was 'nerfed', critics call it poor science...

BridgeMind AI claimed that Claude Opus 4.6 from Anthropic was secretly degraded after a new measurement of hallucinations. The viral post has since faced heavy criticism for flawed methodology.
The claim sparked a major debate about whether AI companies are quietly downgrading paid models to cut costs.
BridgeMind predicts a 98% increase in hallucinations
BridgeMind, the team behind the BridgeBench code benchmark, published that Claude Opus 4.6 had fallen from second to tenth place on their hallucination leaderboard. Accuracy is reported to have dropped from 83.3% to 68.3%.
'CLAUDE OPUS 4.6 HAS BEEN NERFED. BridgeBench has just proven it. Last week, Claude Opus 4.6 was ranked second on the hallucination benchmark with an accuracy of 83.3%. Today, Claude Opus 4.6 was retested and fell to 10th place on the leaderboard with an accuracy of only 68.3%,' they wrote.
The post presented this as 'evidence' of reduced reasoning ability. However, a closer examination of the underlying data tells a different story.
Critics say the comparison is fundamentally flawed
According to computer scientist Paul Calcraft, the claim is 'incredibly poor science' and points out critical issues with the methodology.
'Incredibly poor science. You tested Opus on 30 tasks today, the previous score was only on *6* tasks. The result for the 6 common tasks: 85.4% today vs. 87.6% last time. The change is mainly due to a *single* misrepresentation without repetition – light statistical noise,' commented Calcraft.
The original high score came from only six tasks. The new retest expanded the benchmark to 30 tasks.
In the six overlapping tasks, the performance was almost identical, a decrease only from 87.6% to 85.4%.
The small change is mostly due to one additional misrepresentation in one task. Without repeated measurements, this falls well within the normal statistical variation for AI models.
Large language models are not deterministic, and one poor output on a small sample can significantly shift the results.
Greater frustration fuels the narrative
Nonetheless, the post struck a nerve. Since its launch in February 2026, Claude Opus 4.6 has received persistent complaints about perceived quality decline.
Developers report shorter responses, weaker instruction adherence, and reduced depth in reasoning, especially during high-pressure periods.
Some of this is due to deliberate changes in the product. Anthropic introduced adaptive thinking controls that allow the model to adjust how much it should think. The default setting for effort level was later set to medium, with an emphasis on efficiency over maximum depth.
An independent analysis of over 6800 Claude Code sessions showed that reasoning depth fell by about 67% during February.
The model's share of file reading before code editing dropped from 6.6 to 2.0. This suggests it attempted to correct code it barely read.
What this means for AI users
This reflects a growing tension in the AI industry. Companies are optimizing their models for cost and scalability after launch, while heavy users expect consistent top performance. The gap between these priorities undermines trust.
Based on the available evidence, the BridgeBench data does not show a deliberate downgrade. The benchmark comparison was not among like samples, and the overlapping results were almost identical.
The frustration behind the claims is, however, not entirely unfounded. Adaptive computational controls and adjustments to the service have changed how Claude Opus 4.6 actually behaves. For developers relying on consistent output, it matters.
As of April 13, Anthropic has not made a public statement regarding the specific claims from BridgeBench.