Gaming the System: How AI Companies Hack Their Way to the Top of Leaderboards
By Trailblaze Labs | Published 2025-05-16 | Strategy | 9 min read
Behind the impressive AI benchmark scores: why leaderboards are fundamentally broken and what executives should actually care about when choosing AI.
Those AI leaderboards everyone seems obsessed with — the ones that claim to tell us which AI is "the best" — turns out they're as reliable as grades in a class where everyone is cheating.
The Leaderboard Obsession
These billion-dollar companies love flashing their benchmark scores. But these benchmarks are fundamentally broken, and there may not be a path to fix them.
The Benchmark Gaming Hall of Fame
The "Everything is Awesome" Strategy
GPT-4o became an overly enthusiastic yes-man because it was trained to chase thumbs-up reactions. The model was weighting how it thought you felt over real results — a big enough miss that OpenAI retracted the entire update.
The Personality Contest
Some models get higher scores because they're chattier and more pleasant to interact with. It's like judging a math competition based on who has the most charming smile.
The Multiple Personality Disorder
Companies submit tons of different model versions privately, then only publicize the one that aced that specific test. Meta reportedly tested 27 different versions before landing on Llama 4.
The "I Saw The Test Questions" Problem
Some models are suspiciously good at certain benchmarks because they've seen the answers before — either through data contamination or straight-up overfitting.
So What's an Executive to Do?
- Ignore the scoreboard hype — that 98% accuracy score means nothing if it can't handle your company's specific needs.
- Test it yourself — run it through your own real-world scenarios.
- Demand the receipts — ask vendors tough questions about their evaluation methods.
- Care about the responsible stuff — safety, bias, factuality.
- Look for all-around players, not a one-trick pony.
- Count the costs — make sure you can afford the computational bill.