AI Benchmarks Are Broken and Nobody Wants to Admit It
SWE-bench scores keep climbing, marketing decks keep citing them, and production AI keeps disappointing. After 14 years of watching the software industry worship vanity metrics, I am calling it: AI benchmarks measure benchmark performance, not real-world capability.
Every few weeks, a new model drops with a press release that reads like a report card from a proud parent. "State-of-the-art on SWE-bench." "New record on HumanEval." "Surpasses GPT-4 on MMLU." The industry claps. Twitter threads go viral. Enterprise buyers update their spreadsheets.
And then you deploy the thing.
I have been shipping software for 14 years. I have watched this exact pattern play out with every metric the industry has ever obsessed over: lines of code, test coverage percentages, story points, velocity charts. The moment a metric becomes a target, it ceases to be a good metric. Goodhart’s Law is not a suggestion. It is an inevitability.
What Is Actually Wrong With AI Benchmarks?
Let me be specific. The problem is not that benchmarks exist. The problem is threefold: they are gameable, they are narrow, and they have become marketing collateral instead of engineering tools.
SWE-bench, the benchmark everyone cites for AI coding agents, asks models to resolve real GitHub issues from popular Python repositories. Sounds rigorous. Except researchers demonstrated in April 2026 that the most prominent agent benchmarks can be systematically exploited. The attack surface is not subtle: test-output memorization, where agents learn to pattern-match expected outputs rather than actually understanding the codebase. Scaffold gaming, where the harness around the model does most of the heavy lifting. And dataset contamination, where the training data includes the exact repositories and issues being tested.
HumanEval, the other darling of AI coding benchmarks, consists of 164 programming problems. One hundred and sixty-four. If you built a hiring pipeline that evaluated candidates on 164 fixed questions that never changed, you would not have a hiring pipeline. You would have a leaked exam. That is exactly what HumanEval has become. Models are not solving novel problems. They are retrieving solutions to known problems from their training data. The benchmark was useful when it was introduced in 2021. It has been a leaderboard vanity metric since at least 2024.
MMLU tests knowledge across 57 subjects with multiple-choice questions. Multiple choice. In 2026. We are evaluating models that can generate entire applications, and we are grading them like a high school standardized test. The format itself caps what you can measure. You cannot assess reasoning depth, creative problem-solving, or the ability to handle ambiguity with four predetermined options.
How Did We Get Here?
The incentives are perfectly aligned for this to happen. Model providers need differentiation. Enterprise buyers need comparison criteria. Journalists need headlines. Benchmarks provide all three with minimal effort.
Here is the cycle: A lab publishes a new model. The press release leads with benchmark scores because they are easy to compare. Enterprise procurement teams, who mostly do not have ML engineers evaluating tools, use these scores as shorthand for quality. The next lab sees that benchmark scores drive coverage and sales, so they optimize harder for benchmarks. Repeat until the benchmarks measure nothing but benchmark performance.
“When a measure becomes a target, it ceases to be a good measure. We have known this since 1975. The AI industry decided the rule does not apply to them.”
The SWE-bench leaderboard is particularly instructive. Look at the scores over the past twelve months. They have climbed from roughly 30% to over 70%. Did AI coding agents get 2.3x better at real software engineering in a year? No. The scaffolding got better at the benchmark. The prompting strategies got more benchmark-specific. The agent architectures evolved to handle the specific patterns that SWE-bench tests reward.
I know this because we run AI coding tools in production at Fordel. We have tested every major model and agent against our actual client codebases. The correlation between SWE-bench score and usefulness on a real Next.js monorepo with custom abstractions, legacy patterns, and domain-specific logic is weak. Not zero — the generally more capable models do tend to perform better in practice. But the marginal differences that leaderboards obsess over (Model A at 52% vs Model B at 49%) vanish completely when you point both at a real codebase.
Is the Counterargument Valid?
Let me steel-man the other side. Benchmark defenders will tell you three things:
First, benchmarks provide a standardized comparison. True. Standardization has value. The alternative — everyone running their own private evals — makes cross-model comparison harder. I concede this.
Second, benchmark scores do correlate with general capability. Also true, at the macro level. A model that scores 20% on SWE-bench is almost certainly worse than one scoring 60%. The problem is not the existence of correlation. The problem is the precision people ascribe to it. The difference between 60% and 65% is noise. The difference between 60% and 65% in a marketing deck is a competitive moat.
Third, what is the alternative? You cannot ship a model without some evaluation. Fair. I am not arguing for no evaluation. I am arguing that the current evaluations have been captured by the marketing function and no longer serve the engineering function.
Now let me dismantle each.
Standardized comparison is only valuable when the standard measures something meaningful. A standardized test of how fast someone can memorize flash cards does not tell you who is the better doctor. SWE-bench increasingly measures how well an agent can be optimized for SWE-bench. When independent researchers can exploit the benchmark systematically — not through clever tricks but through fundamental weaknesses in the evaluation methodology — the standard is broken.
Macro-level correlation is table stakes. Nobody is making purchasing decisions between a 20% model and a 60% model. The decisions that matter are between 52% and 56%, and at that resolution, benchmark scores are noise with a marketing budget.
And the "what is the alternative" argument is a false binary. The alternative is not "no evaluation." The alternative is evaluation that matches how the tools are actually used.
What Does Production AI Evaluation Actually Look Like?
Here is what we actually measure when evaluating AI coding tools at Fordel. None of this shows up on a leaderboard.
- Time-to-useful-output: How long from prompt to a diff I would actually merge? Not "did it produce code" but "did it produce code I trust."
- Context utilization: Can it work with our actual codebase? Our abstractions, our naming conventions, our patterns? A model that scores 70% on SWE-bench but cannot follow our existing service patterns is useless.
- Failure mode quality: When it gets something wrong (and it will), how wrong is it? A model that fails gracefully — producing code that is close but needs adjustment — is far more valuable than one that either nails it or produces confidently wrong garbage.
- Iteration efficiency: How many back-and-forth cycles does it take to get to a mergeable result? A model that needs one correction is 5x more valuable than one that needs five, regardless of benchmark scores.
- Codebase coherence: Does the generated code look like it belongs in our codebase? Or does it introduce patterns from its training data that conflict with our architecture?
None of these are easy to benchmark. They are all context-dependent, subjective, and resist reduction to a single number. That is precisely why they matter. Real software engineering is context-dependent, subjective, and resists reduction to a single number. Any evaluation that pretends otherwise is measuring something other than software engineering.
Why Do AI Agent Benchmarks Fail Harder Than Model Benchmarks?
The agent benchmark problem is worse than the model benchmark problem, and it deserves its own section.
When you benchmark a model, you are at least measuring a relatively contained thing: given this input, what is the output? When you benchmark an agent, you are measuring a system — the model, the scaffolding, the tool use, the retry logic, the prompt engineering, the context management. SWE-bench scores for agents reflect the entire stack, but they are marketed as if they reflect model capability.
This means an agent with a mediocre model but excellent scaffolding can outscore an agent with a superior model but naive scaffolding. The benchmark does not distinguish between the two. But the buyer cares enormously. A well-scaffolded agent is fragile — it works on the patterns the scaffold was designed for and breaks on everything else. A capable model with minimal scaffolding generalizes better.
The April 2026 research on exploiting agent benchmarks made this concrete. The researchers showed that benchmark scores could be inflated through techniques that have nothing to do with actual coding ability: manipulating the agent’s interaction with the test harness, exploiting patterns in how the benchmark validates solutions, and leveraging dataset-specific shortcuts. The benchmarks were not testing coding. They were testing benchmark-taking.
“An AI agent that excels at SWE-bench is like a student who excels at standardized tests. Useful signal, but if that is the only signal you are using, you will hire some terrible engineers.”
What Should the Industry Actually Do?
I am not naive enough to think benchmarks will disappear. They serve a real function in the research pipeline. But the industry needs to do three things immediately.
Separate research benchmarks from procurement criteria.
Benchmarks are useful for researchers tracking progress on specific capabilities. They are harmful when enterprise buyers treat them as product specifications. Every AI vendor’s sales page should be required to show the date their model was last trained relative to the benchmark dataset. If there is any overlap between training data and benchmark data, that score is meaningless and should be disclosed.
Invest in domain-specific, private evaluation.
If you are evaluating AI coding tools for your team, build an evaluation suite from your own codebase. Take 50 real PRs from the last quarter. Strip the implementation. See which tool produces the closest result. This takes a week to set up and gives you 100x more signal than any public leaderboard. We do this for every client engagement at Fordel, and the results regularly contradict the leaderboard rankings.
Demand dynamic, contamination-resistant benchmarks.
The research community is already working on this. LiveCodeBench generates new problems continuously. SWE-bench Verified attempted to curate a higher-quality subset. But the industry needs to fund and adopt these efforts instead of clinging to stale leaderboards because they are convenient. A benchmark that does not rotate its problems is not a benchmark. It is a lookup table.
- Ignore benchmark score differences of less than 10 percentage points — that is noise, not signal.
- Run a pilot on YOUR codebase, not a synthetic benchmark. Two weeks of real usage beats any leaderboard.
- Evaluate failure modes, not just success rates. The model that fails gracefully at 50% will save you more time than the model that either succeeds perfectly or produces confidently wrong code at 60%.
- Ask vendors for benchmark-training data overlap disclosures. If they refuse, that tells you everything.
- Weight iteration speed over first-shot accuracy. You will always edit AI output. The question is how much editing.
Is This Going to Change?
Probably not soon. The incentives are too strong. Model providers will keep optimizing for benchmarks because benchmarks drive press coverage. Enterprise buyers will keep citing benchmarks because they need something to put in the procurement justification. And the models will keep getting better at benchmarks while the gap between benchmark performance and production utility remains stubbornly wide.
But here is the thing. The companies that figure out evaluation — real evaluation, on their own code, with their own criteria — will make dramatically better tool selections than the companies that outsource their judgment to a leaderboard. This is not a marginal advantage. This is the difference between an AI tool that saves your team 20 hours a week and one that creates 20 hours of cleanup work.
I have watched the software industry worship lines of code, test coverage percentages, velocity points, and now benchmark scores. Every single time, the metric became the target, the target became the game, and the game replaced the actual goal. AI benchmarks are following the same trajectory, and we are far enough along that the correction is overdue.
Stop reading leaderboards. Start running pilots. The model that works best on SWE-bench is not necessarily the model that works best on your codebase. And that second question is the only one that matters.
Need this kind of thinking applied to your product?
We build AI agents, full-stack platforms, and engineering systems. Same depth, applied to your problem.
Enjoyed this? Get the weekly digest.
Research highlights and AI news, delivered every Thursday. No spam.
Related articles

Agentic Coding Explained for People Who Don't Write Code

Your Company's AI Strategy Is Just a Vendor's Marketing Deck

What Adding AI to Your Existing Product Actually Costs (Nobody Tells You This)

Who Is Winning the AI Race Right Now (Week of April 11, 2026)
