Skip to content
← back to blog Leer en Español

Claude Opus 4.7: 87.6% on SWE-bench and What It Means for AI Agents in 2026

Developer writing code on a laptop representing the new era of AI-assisted software engineering with Claude Opus 4.7

Anthropic released Claude Opus 4.7 on April 16, 2026, and the benchmarks tell a clear story: this is now the most capable publicly available AI model for software engineering and long-horizon agentic tasks. Scoring 87.6% on SWE-bench Verified—up from 80.8% on its predecessor—Claude Opus 4.7 moves decisively ahead of GPT-5.4 and Gemini 3.1 Pro on the metrics that matter most to teams building production AI agents. Three new capabilities—a new effort level, a 3.3× vision upgrade, and a task-budget system—together signal something more important than another benchmark shuffle: AI agents are growing more reliable, more controllable, and more deployable in real business environments.

Developer writing code on a laptop representing the new era of AI-assisted software engineering with Claude Opus 4.7 Photo by Safar Safarov on Unsplash

The Benchmark Story: What 87.6% on SWE-bench Actually Means

SWE-bench Verified is widely regarded as the most rigorous real-world coding benchmark available. Unlike multiple-choice tests, it asks models to fix actual GitHub issues in real open-source repositories—the kind of work a junior or mid-level software engineer does every day. Claude Opus 4.7’s score of 87.6% is not just a new high for the model; it is the highest score ever recorded by a generally available model on this benchmark.

The improvement over Opus 4.6 is equally striking across every tier:

BenchmarkClaude Opus 4.6Claude Opus 4.7Change
SWE-bench Verified80.8%87.6%+6.8 pp
SWE-bench Pro53.4%64.3%+10.9 pp
CursorBench58%70%+12 pp
GPQA Diamond~88%94.2%+~6 pp

SWE-bench Pro is the harder variant of the benchmark, featuring issues that require understanding large codebases and multi-file reasoning. A jump from 53.4% to 64.3% — nearly 11 percentage points — means that Claude Opus 4.7 can now independently resolve problems that would have required human intervention in every other Opus release. CursorBench, which tests the model inside the Cursor IDE on real developer workflows, rose 12 points to 70%, the first time any model has crossed that threshold.

On GPQA Diamond, which measures graduate-level scientific reasoning across physics, chemistry, and biology, Opus 4.7 scores 94.2% — nearly tied with Gemini 3.1 Pro’s 94.3%. Both models now exceed estimated expert human performance on this benchmark.

What does this mean in practice? A team using Claude Opus 4.7 as its code-review agent can expect it to catch significantly more logic errors, suggest architecture improvements, and resolve pull request issues with less back-and-forth. On complex codebases — the kind that typically require a senior engineer’s familiarity with the system — the model’s accuracy gap versus a human developer has narrowed to its smallest point ever.

Three New Capabilities That Change How Agents Work

Anthropic didn’t just tune the weights—it shipped three distinct features that address real pain points teams hit when running Claude in production agentic pipelines.

1. The xhigh Effort Level

Previous Claude releases offered four effort levels: low, medium, high, and max. Opus 4.7 inserts xhigh between high and max, giving developers a new point on the reasoning-versus-latency curve. Claude Code—Anthropic’s coding CLI—now defaults to xhigh for all plans, including free tiers.

The practical effect is meaningful. Max effort maximizes output quality but adds significant latency, which makes it impractical for anything requiring near-real-time tool calls. High effort is fast but can miss subtle inference chains in complex debugging tasks. xhigh threads the needle: it activates deeper reasoning without the full latency penalty of max. For developers running iterative coding agents, this alone can cut the number of failed loops that require human re-prompting.

2. A 3.3× Vision Upgrade

Claude Opus 4.7 accepts images up to 2,576 pixels on the long edge, equivalent to approximately 3.75 megapixels. Opus 4.6 topped out at 1.15 megapixels. The practical implications are significant in two areas.

First, computer use: the model’s coordinate system now maps 1:1 with actual screen pixels, eliminating the rounding errors that caused misclicks in earlier computer-use agents. Anyone building UI automation with Claude will see an immediate improvement in precision.

Second, enterprise document analysis: scanned contracts, technical drawings, compliance reports, and financial statements often include fine print, footnotes, or dense tabular data that lower-resolution models miss or hallucinate. At 3.75 megapixels, Claude Opus 4.7 can process these documents with the kind of fidelity that enterprise workflows require.

3. Task Budgets (Beta)

This is the quietest feature in the release notes, and potentially the most important for production deployments. A task budget lets you specify a token ceiling for an entire agentic loop — thinking steps, tool calls, tool results, and the final output combined. The model sees a running countdown and wraps gracefully as the budget approaches rather than cutting off abruptly or exceeding limits unexpectedly.

Task budgets are activated via the task-budgets-2026-03-13 beta header in the API, with a minimum of 20,000 tokens. For teams managing complex, multi-step agents that call external APIs, query databases, and reason over long documents, this feature makes cost and latency predictable in ways that were previously impossible without custom orchestration logic.

The Hidden Cost: Read the Tokenizer Change First

Anthropic kept pricing flat — $5 per million input tokens, $25 per million output tokens — the same as Opus 4.6. But there is a catch that finance teams and engineering leads need to understand before migrating.

Claude Opus 4.7 ships with a new tokenizer. For the same input text, the model uses 1.0× to 1.35× more tokens than Opus 4.6, depending on the content type. In other words, a prompt that costs $1.00 today could cost up to $1.35 after migrating, without any changes to the prompt or logic.

The variance is content-dependent. Code-heavy inputs tend to see smaller increases (closer to 1.0×), while natural-language documents — contracts, emails, research papers — can hit the full 1.35× factor. Teams with large-scale document-processing pipelines should benchmark their token usage before switching at scale.

This is not a criticism of Anthropic’s pricing strategy — new tokenizers typically deliver better model performance per token, and the benchmark gains justify the investment. But enterprise teams operating at thousands of API calls per day need to account for this before signing off on a migration plan. Running a representative sample of prompts through both models and comparing token counts is now a mandatory step before switching production systems.

Where Claude Opus 4.7 Still Trails

Honest benchmark analysis requires acknowledging where Opus 4.7 doesn’t lead. On agentic search — tasks involving web retrieval, multi-source synthesis, and knowledge lookup — GPT-5.4 scores 89.3% compared to Claude Opus 4.7’s 79.3%. That 10-point gap is meaningful for use cases like research agents, competitive intelligence tools, or any workflow that depends heavily on real-time web data.

On computer use via OSWorld-Verified, GPT-5.4 scores 75.0%, exceeding the human baseline of 72.4% — the first general-purpose model to do so. Anthropic has not published a comparable Opus 4.7 score on this benchmark, suggesting the gap is not in Claude’s favor for general UI navigation tasks.

The picture that emerges is nuanced: Claude Opus 4.7 is the right choice for code-heavy and reasoning-heavy workflows. GPT-5.4 is the stronger option for agentic search and broad computer-use automation. Gemini 3.1 Pro remains competitive on science reasoning (94.3% GPQA Diamond) and on multimodal tasks with its 2 million-token context window. Smart deployment strategies will increasingly involve routing specific task types to the model best suited to handle them — an architectural pattern that MCP’s 97 million install milestone makes easier than ever.

Model Benchmark Comparison — April 2026

SWE-bench Verified
Claude 4.7 — 87.6%
GPQA Diamond
94.2%
SWE-bench Pro
64.3%
CursorBench
70%
Agentic Search (GPT-5.4)
89.3%

White bars = Claude Opus 4.7. Dimmed bar = GPT-5.4 on Agentic Search (where Claude trails). Sources: Anthropic, Apiyi, Verdent.

What This Means for Businesses Deploying AI in 2026

Claude Opus 4.7’s release marks an inflection point that extends well beyond the AI research community. For business leaders evaluating AI investments, the upgrade raises three concrete questions.

First: is your current AI coding workflow ready to upgrade? If your team uses Claude via the API or through Claude Code, migrating to Opus 4.7 will improve results on code review, bug fixing, and complex refactors — but only after you account for the tokenizer change. Run your most common prompts through both model versions and compare token counts before setting your new billing budget.

Second: does your use case call for Claude or a competitor? For document analysis, contract review, scientific reasoning, and agentic coding pipelines, Claude Opus 4.7 is now the strongest option publicly available. For workflows that lean heavily on real-time web retrieval and broad computer-use automation, GPT-5.4 remains competitive. Multi-model architectures — where different task types route to different models — are becoming the norm rather than the exception. If your team is building that kind of infrastructure, AgentsGT provides frameworks specifically designed for multi-model agent orchestration.

Third: are you capturing the multi-cloud flexibility? Claude Opus 4.7 launched simultaneously on Amazon Bedrock, Google Cloud Vertex AI, Microsoft Foundry, and the Anthropic API. This day-one multi-cloud availability is deliberately strategic: it means companies can run Claude in whichever cloud environment already hosts their data, without cross-cloud egress costs or compliance headaches. For regulated industries in particular — financial services, healthcare, legal — this eliminates a major barrier to deploying frontier-class AI.

The broader context matters here. This release arrives the day before xAI rolled out Grok 4.3 beta exclusively for $300/month SuperGrok Heavy subscribers — a stark contrast in accessibility strategies. Anthropic’s decision to maintain Opus 4.7 pricing at $5 per million input tokens while shipping material capability gains signals a different philosophy: push the frontier capability down to standard API pricing rather than gate the best features behind premium tiers. For SMBs and scaling startups, that pricing philosophy matters enormously.

If you’re thinking about how Claude Opus 4.7 fits into a broader AI transformation strategy — whether that means AI coding agents, intelligent document workflows, or multi-model orchestration — the DDR Innova team builds production deployments at this level. We’ve been tracking these benchmark shifts in real business contexts, and we’re happy to help you figure out which tools belong in your stack. Reach out at info@ddrinnova.com or book a call to talk through your specific use case.


Sources

Frequently Asked Questions

Is Claude Opus 4.7 the best AI model available right now?

Claude Opus 4.7 leads on software engineering and agentic reasoning benchmarks, including SWE-bench Verified (87.6%) and GPQA Diamond (94.2%). However, GPT-5.4 still holds the edge on agentic search (89.3% vs. 79.3%) and computer use. The right model depends on your specific workflow.

How does the new tokenizer in Claude Opus 4.7 affect API costs?

Opus 4.7's new tokenizer processes the same input text using 1.0x to 1.35x more tokens than Opus 4.6. With pricing unchanged at $5 per million input tokens, teams running large prompt volumes can see costs rise up to 35%. Reviewing existing agent pipelines before migrating is strongly recommended.

What is the task budgets feature in Claude Opus 4.7?

Task budgets let you set a token ceiling for an entire agentic loop—thinking, tool calls, and output combined. The model tracks the budget in real time and wraps gracefully as it approaches the limit. It is currently in beta and requires the task-budgets-2026-03-13 header in the API request.

Where can businesses access Claude Opus 4.7?

Claude Opus 4.7 is available on day one through Anthropic's Claude API, Amazon Bedrock, Google Cloud Vertex AI, and Microsoft Foundry. All four platforms were live on the April 16 launch date, making enterprise multi-cloud deployment straightforward from the start.

Share