technology

Cursor Composer 2.5: When the Code Editor Builds Its Own Model

Cursor shipped Composer 2.5 on May 18, matching Opus 4.7 on SWE-Bench Multilingual at a fraction of the cost.

May 19, 2026

English

There is a quiet inversion happening in the developer tools market, and Cursor's release of Composer 2.5 on May 18 is the clearest illustration of it yet. The conventional shape of the AI coding business since 2023 has been straightforward: the labs build models, the editors wrap them, the editors take a margin on top of the per-token bill. The editor is the front end and the model is the work; the editor's job is to make the model easier to use, faster to integrate, and less awkward to interrupt. That arrangement made sense when the labs were the only credible source of frontier-grade code generation. It makes less sense in May 2026, with open-weight checkpoints from second-tier labs reaching close enough to frontier capability that an editor with strong product instincts and enough engineering capacity can take a checkpoint, fine-tune it on twenty-five times more synthetic training data than the previous generation, and ship a model that holds its own against the labs whose work it was once paying to wrap.

Composer 2.5 is Cursor's most explicit move in that direction. The team built it on top of an open-weight checkpoint from Moonshot — the Kimi K2.5 base model that has been quietly powering large fractions of the open-source coding community since late 2025 — and trained it on a synthetic task generator that produced roughly twenty-five times the number of training tasks used for Composer 2. The result is a model that ties or comes within a percentage point of Claude Opus 4.7 on two of the three benchmarks Cursor reports, beats GPT-5.5 on the multilingual code change benchmark, and costs an order of magnitude less per million tokens to run. For the next week, Cursor is doubling the included usage of the model on every paid plan, which is the kind of confidence statement a company makes when the unit economics of its own model are substantially better than the unit economics of paying for someone else's.

The numbers worth dwelling on come from three benchmarks Cursor has positioned as its primary scorecard. Terminal-Bench 2.0 measures the model's ability to operate autonomously inside a terminal, executing commands, interpreting output, and recovering from errors over multi-step tasks. Composer 2.5 hits sixty-nine point three percent on it. Opus 4.7 hits sixty-nine point four. The gap is one tenth of one percent, which on a benchmark like this is statistical noise. GPT-5.5 sits well above both at eighty-two point seven, which reflects OpenAI's heavy investment in agentic terminal workloads through the Codex line and is consistent with the result from the GPT-5.3-Codex launch in February. On Terminal-Bench, Composer 2.5 is competitive with the second-best frontier model and noticeably behind the leader, which is a respectable place for an editor-built model to land in its second major version.

SWE-Bench Multilingual is the more interesting result. The benchmark measures the model's ability to handle real repository-level code changes across a mix of programming languages, not just Python, which is closer to what most production engineering teams actually do than the original English-only SWE-Bench. Composer 2.5 hits seventy-nine point eight percent on it. Opus 4.7 sits at eighty point five, GPT-5.5 at seventy-seven point eight, and the previous Composer at seventy-three point seven. The pattern is that Composer 2.5 is within seven tenths of a percent of the model that Anthropic charges seventy-five dollars per million output tokens for, and ahead of the model that OpenAI charges fifteen dollars per million output tokens for. The price-performance ratio implied by those numbers is what makes the release strategically consequential rather than merely incremental. Cursor is not selling a worse model for less money. It is selling a comparable model for substantially less money, on a benchmark that maps reasonably well to what their actual users do all day.

CursorBench v3.1, the third score in the release, is Cursor's internal benchmark of harder tasks — the ones their own users routinely struggle with and the ones the team has been refining since the original Composer release. Composer 2.5 hits sixty-three point two percent on it. Opus 4.7 in its maximum-reasoning configuration scores sixty-four point eight, and in its default xhigh configuration sixty-one point six. GPT-5.5 at xhigh hits sixty-four point three and at medium default hits fifty-nine point two. The Composer 2 score was fifty-two point two. The shape of these numbers tells a story that the simpler benchmark comparisons miss: Composer 2.5 in its default configuration is ahead of both Opus 4.7 and GPT-5.5 in their default configurations, and only behind them when the more expensive models are explicitly cranked up to their maximum-cost reasoning modes. For the typical interactive coding session — where the model is asked to make a change, run a test, fix the error, and iterate — default-mode performance is what matters. Maximum-reasoning configurations come into play for occasional long-horizon problems where the user is willing to wait a minute and pay several times the normal cost. Default mode is the bulk of the work, and in default mode Composer 2.5 is the leader on this benchmark.

The pricing structure that Cursor has chosen for Composer 2.5 makes the strategic intent unambiguous. The standard tier costs fifty cents per million input tokens and two dollars fifty per million output tokens. The fast tier, optimized for low latency and used as the default for interactive coding, costs three dollars per million input and fifteen dollars per million output. Compare those numbers against Opus 4.7's fifteen-and-seventy-five pricing and the gap becomes the entire story. On output tokens specifically, Composer 2.5 standard is thirty times cheaper than Opus 4.7. Even the fast tier, which is the realistic comparison point for interactive work, is five times cheaper on output. A developer who runs Composer 2.5 in fast mode all day pays roughly what they would have paid for Opus 4.7 in extreme-throttled budget mode, and gets comparable benchmark results on the work they actually do.

The architectural choice that made this possible is worth being explicit about. Cursor did not train a model from scratch. Training a frontier-grade base model still costs roughly a hundred million dollars and requires a research operation Cursor does not have. What Cursor did do is take Moonshot's Kimi K2.5 base — itself a strong open-weight model that has been benchmarked into the top fifteen on Arena since late 2025 — and apply heavy post-training using synthetic tasks generated specifically to teach the model the patterns Cursor's users actually exercise. Twenty-five times more synthetic tasks than the previous Composer, with each task constructed to match the shape of the work the model would be asked to do in production. That is a different kind of training problem from a labs-style base model run, and it costs a different kind of money. It is plausibly within reach for any well-funded developer tools company with a strong eval pipeline and access to enough GPU capacity to run a serious post-training cycle. Cursor demonstrated that the wrap-the-labs strategy was a transitional model, not a permanent one.

The strategic implication for Anthropic, OpenAI, and Google is more subtle than it first appears. Composer 2.5 does not threaten the frontier labs directly, in the sense that the frontier labs are still ahead at the absolute top of capability. What Composer 2.5 threatens is the labs' captive demand from the layer of the market that has been their highest-margin business: the bulk of professional developer interactive coding sessions, where the customer is willing to pay for quality but does not need the absolute frontier and is increasingly conscious of the bill. If Cursor can serve that demand from an in-house model at a tenth of the cost, the labs are left with the workloads at the top of the capability curve — important workloads, but smaller in volume — and the workloads at the bottom of the cost curve, where the open-weight providers like DeepSeek are already winning. The middle, which had been the labs' commercial sweet spot, gets squeezed from both sides. Cursor moving Composer 2.5 to the default reasoning model for its product means a substantial fraction of the developer-tools API spend that previously flowed to Anthropic and OpenAI stops flowing through them at all.

The other dimension that makes Composer 2.5 worth paying attention to is what Cursor says it is better at, and the language they used in the announcement is specific enough to be worth taking seriously. The team called out three things. Composer 2.5 is described as more intelligent, which is a vague claim but a familiar one. It is described as better at sustained work on long-running tasks, which is the more interesting claim because it points at the multi-turn agentic workflows where models have historically lost coherence after a few dozen steps. And it is described as more reliable at following complex instructions, which is the claim that most directly maps to what professional users complain about when they are unhappy with a model. Anyone who has shipped a Composer-class product knows the actual user complaints are rarely about raw capability; they are about the model wandering off the instruction, ignoring constraints, or producing the right answer to the wrong question. If 2.5 has measurably improved on that axis, the user-perceived quality improvement will be larger than the benchmark numbers suggest, because benchmarks do not capture instruction-following discipline very well.

The doubled-usage promotion for the first week is the other tell. Cursor is making the bet that getting their existing user base to actually try Composer 2.5 — rather than reflexively reaching for Opus 4.7 the way most of them have been doing since April — will produce a behavioral shift that lasts after the promotion ends. The economics of that bet are straightforward. If a meaningful fraction of users move their default model from Opus to Composer after the first week of comparison, Cursor's gross margin on every coding session improves dramatically, because they own the model instead of paying Anthropic for it. The promotion is essentially a one-week customer acquisition campaign aimed at the company's existing customers, structured as a usage cap increase rather than as a discount. It is a more sophisticated move than it might look at first glance, and the fact that the team felt confident enough to run it tells you what they think the comparison is going to look like.

The next twelve months in the developer tools market are going to be defined by how many other editors follow this playbook. The answer is almost certainly several of them. Replit has the infrastructure, the user base, and the engineering depth to do something similar. GitHub Copilot, which has been quietly extending its proprietary capabilities beneath the brand, has the resources to fine-tune a base model into a Copilot-specific variant if Microsoft decides the economics justify it. JetBrains has the position to do it for the IDE tooling segment specifically. None of these moves require building a base model. They require a credible open-weight checkpoint to start from, enough engineering capacity to run a serious post-training cycle, and the strategic conviction to stop being a passive consumer of the labs' API output. Cursor has now demonstrated all three.

The labs will respond, and the response is predictable enough that it is worth previewing. Expect pricing pressure on the mid-tier output token rate from at least one of the major labs in the next sixty days. Expect more aggressive enterprise discounting for the developer tools segment specifically, possibly tied to commitments around model exclusivity. Expect at least one of the labs to ship a tighter, cheaper, code-specialized variant of their frontier model with pricing positioned somewhere between their current premium tier and the in-house alternatives the editors are building. None of those responses change the underlying dynamic — the labs are no longer the only credible source of frontier-grade coding capability — but they will slow the rate at which the developer tools segment shifts away from them, and they will keep the labs commercially relevant in the segment for at least another generation.

The release that matters more than Composer 2.5, in the long run, is whichever editor goes next. If the move that worked for Cursor works for the next two or three editors that try it, the developer tools API business as the labs have understood it for the last three years stops existing in roughly its current form. That is a fast transition for a market that did not see this particular threat coming.

For how multi-model routing layers handle a world where editors increasingly own their own coding models, explore Swfte Connect. For the full May 2026 model pricing context, see our AI API pricing trends post.

Sources:

Publié danstechnology

Cursor Code Generation Developer Tools Composer IDE

Enjoyed this article?

Get more insights on AI and enterprise automation delivered to your inbox.

← Back to all articles