Score Higher, Feel Worse - Why AI Coding Benchmarks Might Be Measuring the Wrong Things | PizzaConsole

Preface

I want to be upfront: I don't have all the answers here. I'm not a researcher, I'm not on Cursor's team, and I don't have insider knowledge about how their models are trained or how their business decisions get made. What I have is a lot of hours inside Cursor, a lot of tokens burned, and a growing list of observations that don't add up when I look at the official narrative.

So instead of writing a takedown, I want to lay out what I've observed, what I think might be happening, and some questions I genuinely want answered. If I'm wrong, I'd love to be corrected.

The Why

A few weeks ago I committed to Cursor's yearly Ultra plan. $2k. I did that because Composer 1.5 was the best AI coding experience I'd had. It followed my patterns, respected my input, and stayed out of the way. It felt like a genuine extra pair of hands.

On March 26, I got this notice:

On April 3, 2026, you will be automatically moved from Composer 1.5 to Composer 2, which is a smarter and cheaper model. No action is needed. We are retiring Composer 1.5 on April 3, 2026.

Smarter and cheaper. No mention of the things that made 1.5 good. No acknowledgment that the experience might be different. Just a one-way migration with 8 days notice for a model I'd been building my entire workflow around.

I posted about it on the Cursor forum. Nobody responded. So I started writing down everything I'd been noticing, and the more I laid it out, the more it looked like a story worth telling beyond a forum thread.

The Observations

Observation 1: Composer 1.5 followed instructions. Composer 2 doesn't.

The month before Composer 2 launched, I pushed nearly 3 billion tokens through Composer 1.5. That was my daily driver. It was the reason I committed to the Ultra plan.

Since Composer 2 dropped, I've been running both models side by side. In the last week and a half alone, I've pushed over 560 million tokens through Cursor. Roughly 410M on Composer 2/2-fast, 135M on 1.5, and 19M on Opus for the occasional hard problem.

I gave Composer 2 three times the token volume. I keep going back to 1.5.

The reason is simple: Composer 2 doesn't do what I ask. It does what it thinks I should have asked. I give it exact text to insert into a component, specific wording, specific structure, labels like "What you are looking at:" and "Why these numbers matter:" that are intentional UX decisions, and Composer 2 rewrites my copy. It drops things I included on purpose, rephrases things I didn't ask it to rephrase, and collapses structure I intentionally separated.

Composer 1.5 reads my codebase, matches my patterns, and uses my input as given. When I hand it literal text, it treats it as literal text. The thinking tokens that 1.5 introduced gave the model just enough reasoning to lock in and move through a full feature slice of a project. It was dialed in.

To be fair, Composer 2 isn't bad at everything. It's fast. On code, it can get results similar to Opus on some tasks while thinking more like a developer, tracing through the code, considering edge cases. If you don't know where to start on something or don't have existing references for it to follow, Composer 2 is arguably more useful than 1.5. But the content it produces is a different story. The wording choices alone tell you it's Kimi under the hood. And when you're working in an established codebase where you need the model to follow your lead, not take its own, the compliance gap is a dealbreaker. Capability and compliance are two different things, and right now Composer 2 has more of the first and less of the second.

Observation 2: The timeline is suspicious.

Here are the dates:

October 2025: Composer 1 launches. MoE architecture, base model never disclosed. Sasha Rush (Cursor's head of research) dodges the question when asked directly.
February 9, 2026: Composer 1.5 ships. Same base model as Composer 1, 20x more RL compute, introduces thinking tokens and self-summarization. Priced at $3.50/$17.50 per million tokens.
February 11, 2026: Cursor creates a separate "Composer" usage pool with "significantly more usage" and temporarily offers 6x usage limits.
March 19, 2026: Composer 2 launches on a completely different base model (Kimi K2.5). Priced at $0.50/$2.50 per million tokens.
~March 25, 2026: The docs page for Composer 1.5 is quietly unlisted. You can still reach it if you know the URL, but it's no longer linked from the models page.
March 26, 2026: Deprecation notice. Composer 1.5 dies April 3.

It took ~3.5 months of careful RL scaling to go from Composer 1 to 1.5. Then 5.5 weeks later, an entirely new model on a completely different foundation ships and the old one is sentenced to death. The docs were already being scrubbed before users even got the deprecation email. This wasn't a decision that was still being considered. It was already made.

Observation 3: The pricing changed dramatically.

Composer 2 is 7x cheaper on input tokens and 7x cheaper on output tokens than 1.5. Kimi K2.5, the base model for Composer 2, is open-source, meaning the foundation is essentially free to build on.

Observation 4: Cursor claims to measure what I'm saying is broken.

From Cursor's own CursorBench blog post, they say the benchmark measures "solution correctness, code quality, efficiency, and interaction behavior." They also say they measure "adherence to a codebase's existing abstractions and software engineering practices." They even describe supplementing benchmarks with live traffic experiments to "catch regressions where the agent's output scores well under an offline grader, but doesn't actually work well for developers."

These are their words. The exact thing I'm experiencing, a model that doesn't follow my input, that rewrites my structure, that proactively adds things I never asked for, should be getting flagged by their own evaluation criteria. Either it isn't, or it is and they're shipping anyway.

Observation 5: The base model was hidden.

Cursor didn't mention Kimi K2.5 in the Composer 2 launch post. A developer found the model ID kimi-k2p5-rl-0317-s515-fast in API headers. Cursor acknowledged it after the fact and called the omission "a miss." Co-founder Aman Sanger said they should have been more transparent. They claim about 25% of the compute came from the Kimi base, with the rest from their own training.

I'm not here to relitigate the attribution controversy. Other people have covered that thoroughly. But it's relevant context for the bigger question.

The Theory

I want to be clear: this is speculation. I can't prove this. But the timeline tells a story if you're willing to read it.

I think Composer 1.5 was too expensive for Cursor to sustain.

They created a generous Composer usage pool on February 11, just two days after 1.5 launched. I'm on the yearly Ultra plan at $160/month. In my heaviest month with 1.5, I pushed nearly 3 billion tokens and racked up around $1,200 in total usage across the Composer and API buckets. Cursor collected $160 from me and ate over $1,000 in compute costs. From one user. In one month. Multiply that across their power user base and the "significantly more usage" promise starts looking like a problem.

Kimi K2.5 offered a way out: an open-source base model with frontier-level capability at a fraction of the cost. Layer RL on top, run it through CursorBench, see the numbers go up (38.0 to 44.2 to 61.3 across three generations), and you have the justification you need to frame a cost-driven swap as an intelligence upgrade.

I don't think the benchmarks drove this decision. I think the economics drove the decision, and the benchmarks provided cover.

Campbell's Law

There's a concept in social science called Campbell's Law: the more a metric is used to make decisions, the more it distorts the thing it was supposed to measure.

The classic example is standardized testing. Test scores are a decent indicator of education quality when teachers are teaching normally. But when you start making funding decisions based on scores, everyone teaches to the test. Scores go up. Actual learning doesn't. The metric stops measuring what it was supposed to measure, but everyone keeps acting like it does.

There's a closely related idea called Goodhart's Law that some readers might know better: "When a measure becomes a target, it ceases to be a good measure." Same principle, different name.

And there's a third related concept called surrogation: when decision-makers start believing the metric IS the thing it measures. "Composer 2 scores higher, therefore it's a better product experience." But the score and the experience are two different things.

CursorBench might be a good benchmark. Cursor has clearly thought about evaluation more carefully than most. They use real developer sessions, they track multiple dimensions, they supplement with live traffic experiments. That's more rigorous than most of the industry.

But when CursorBench says Composer 2 scores 61.3 vs 1.5's 44.2, and my lived experience says 1.5 follows my instructions and 2 doesn't, something in the measurement isn't capturing something that matters.

Here's what no coding benchmark I've seen measures:

Did the model do only what was asked?
Did it preserve structure the user intentionally created?
Did it follow existing patterns instead of inventing new ones?
Did it use the user's exact text when given exact text?
Did it leave files alone that weren't part of the request?

You can't score "restraint." There's no eval for "did the model resist the urge to be helpful in ways nobody asked for."

The Bigger Picture

This isn't just a Cursor problem. Every model provider is chasing benchmark scores because that's what gets press, that's what lands on leaderboards, that's what sells. And the thing that actually makes a coding tool usable day to day, does it listen, does it stay in its lane, does it respect that the developer is the one making decisions, that stuff is invisible to the metrics.

The irony is that Cursor itself seems to understand this. They write about catching regressions "where the agent's output scores well under an offline grader, but doesn't actually work well for developers." That's literally my complaint. So either their detection isn't working, or the economic pressure to ship a cheaper model overrode what the detection was telling them.

I don't know which one it is. I'm asking.

And here's the part that's hard to say out loud: I can't even threaten to leave. There is no better option right now. Windsurf has its own issues. Claude Code is a different workflow entirely. Copilot is less capable. Cursor with Composer 1.5 was the best coding experience I've had, and nothing else matches it. That doesn't make this okay. It makes it worse. They have a captive audience and they're removing the thing that made some of us choose to be captive.

What I'd Like to See

Maybe the answer is building evals that test compliance alongside capability:

"Here's exact text. Did the model use it verbatim?"
"The user asked to change one file. Did the model touch others?"
"The codebase uses pattern X. Did the model follow it or invent pattern Y?"
"The user gave specific structure. Did the model preserve it?"

That stuff is measurable. It's just not being measured, or if it is, it's not being weighted heavily enough to prevent what I'm experiencing.

And maybe the answer isn't just better benchmarks. Maybe it's keeping models available that serve different workflows. Composer 2 is probably great for new projects, for developers who want more guidance, for greenfield work where the model's proactive suggestions are a feature rather than a bug. But in an established codebase with existing patterns and conventions, a model that proactively adds and changes things you didn't ask for is a risk. That's how subtle bugs get introduced. That's how consistency breaks down.

Not every developer needs the smartest model. Some of us need the most obedient one. 1.5 was that model. And right now, I'm watching it get killed in 8 days because a benchmark said something else was better.

Until benchmarks can measure compliance as seriously as they measure capability, we're going to keep shipping models that score higher and feel worse.

I'm a developer who just committed $2k to Cursor's yearly Ultra plan because Composer 1.5 was working for me. I'd love to be wrong about all of this. If someone from Cursor or anyone with deeper knowledge of how these models are evaluated wants to correct my understanding, I'm genuinely all ears.

Preface ​

The Why ​

The Observations ​

The Theory ​

Campbell's Law ​

The Bigger Picture ​

What I'd Like to See ​