Claude Opus 4.8 Released, Anthropic Starts to Make "Trustworthiness" a Selling Point

Bitsfull2026/05/29 18:1111722

Summary:

Enhanced Self-checking and Multi-agent Orchestration for a More Trustworthy Model


Editor's Note: Anthropic has released Claude Opus 4.8, achieving first place in five out of six core benchmarks while maintaining the price; Claude Code has introduced dynamic workflow, and the next-generation Mythos-level model is also on the market horizon.


Beyond mere performance improvements, what is more noteworthy about this release is that Anthropic has begun to shape "trustworthiness" as a key selling point of cutting-edge models.


In honesty testing of the code, Opus 4.8 has significantly reduced its own error omission rate; in Claude Code, it can schedule multiple sub-agents and introduce adversarial self-checks before delivering results. These changes collectively point to a real-world issue: when AI transitions from a chat window to a real workflow, users are most concerned not about the model's inability to complete tasks, but rather that it continues to provide a seemingly complete, smooth, and internally consistent answer even when it errs.


Therefore, the significance of Opus 4.8 lies not only in a model upgrade but also signals a clear industry shift: the competition of cutting-edge models is transitioning from mere benchmark chasing to a focus on reliability, verifiability, and error-exposure capabilities. For businesses and professional users, the next threshold for AI will increasingly depend on whether the model is worthy of trust.


This is also the true prerequisite for Agents to become truly usable. Models need to accomplish more tasks and also need to instill the confidence to entrust it with more important and complex tasks.


The following is the original text:


Anthropic today released Claude Opus 4.8. In the six benchmark tests listed on the release card, it claimed the first place in five of them.


The key change that caught my attention the most is that in Anthropic's code summarization honesty test, Opus 4.7 failed to flag its errors in 19.7% of cases, while in Opus 4.8, this ratio has dropped to 3.7%. For the same task, its ability to identify errors in its own work has improved by approximately fivefold. Anthropic summarized it as a "fourfold" improvement in the announcement. However you calculate it, this is a critical factor in deciding whether you can entrust real work to this model and then leave with peace of mind, more important than any benchmark score on the release card.



What Was Actually Released


Let’s start with the TL;DR before diving into specific numbers:


Reliability has truly improved. In addition to the code honesty metrics mentioned above, Opus 4.8 also became the first to achieve a literal zero in two due diligence tests for the Claude model: it reduced the rate of "error reporting flawed results" from 0.25 to 0.00 and brought the occurrence of "lazy investigations" down from 25% to 0%. Overconfident wrong answers decreased by about 11x. A self-favoring bias, a deviation measurable in 4.7, has disappeared.


The Claude Code now incorporates dynamic workflows in a research preview. Claude now autonomously scripts orchestration, parallel-scheduling dozens to hundreds of child agents in a single session, running standalone adversarial agents that attempt to rebut these results before presenting them to you. This builds on the "Agent Teams" concept proposed in Opus 4.6, now automated.


It leads on its own release card but not across the board. It won five out of six categories. GPT-5.5 still leads in terminal operation tasks. And there are some honest regressions in the system card that Anthropic did not put on the slides, elaborated below.


Pricing remains unchanged. It's still $5 per million input tokens, $25 per million output tokens, the same as 4.7. However, Fast Mode is now three times cheaper than before, although it's still in the premium tier, priced at $10 / $50.


Mythos is on the horizon. Anthropic made it clear that the Mythos-class model with restricted access and high capability will arrive in the coming weeks. Opus 4.8 is the gateway to it.


Official Release Card: Benchmark Landscape


Below is the official release card presented in our color scheme.



One item broke the sweep, and it’s a significant one. In Terminal-Bench 2.1, which tests whether models can complete long-horizon agent tasks via terminal, GPT-5.5 still leads with 78.2% over Opus 4.8's 74.6%. Anthropic acknowledged this failure on their release card rather than opting to hide it. The "Agent vs. Craftsman" divide we mentioned at the GPT-5.5 launch has not completely closed: GPT-5.5 remains a stronger pure terminal operator, while Opus 4.8 behaves more like a stronger engineer on most tasks that matter to professional users, such as real-world coding, expert reasoning, computer use, and knowledge work.


Beyond the Scorecard


The Scorecard only showcased six benchmarks. The 244-page System Card reported over 40 tests, with the most interesting results not on the slide deck. Several standout points include:


A 27-point increase in mathematical ability. In USAMO 2026 held in March this year, Opus 4.8 scored 96.7%, up from 69.3% in 4.7. As the competition took place after Opus 4.8's training cut-off, there was no data contamination issue. This marks the largest intergenerational leap on the entire card.


A widening edge in long-context scenarios. In a million-token image reasoning test, Opus 4.8 scored 68.1 compared to 40.3 for 4.7 and 45.4 for GPT-5.5. The longer the context, the more challenging the task, the more pronounced its lead.


The true crowning achievement lies in Multi-Agent. A single Opus 4.8 Agent lags behind Gemini in web research tasks, scoring 84.3 and 85.9, respectively. However, if a conductor schedules a group of sub-Agents, its score can reach 88.5%, becoming the top-reported result; a team of five Agents can also achieve the best single-Agent score in one-fifth of the time. This showcases the dynamic workflow feature in benchmark testing.


A token efficiency paradigm shift. In the most challenging coding test, Opus 4.8, under the lowest effort setting, can outperform 4.7 at its highest effort setting. In other words, you can achieve peak performance with fewer token costs.


It has crossed thresholds no model has crossed before. In Harvey's Legal Agent Benchmark, success is only achieved when all performance criteria in a task are met. Opus 4.8 is the first model to rank first on this "all-pass" standard. It passed 89% of individual criteria, but the complete task pass rate is only 9.6%, underscoring how demanding real legal work can be.


There have also been honest regressions in presentation. Three things are indeed worse than 4.7, as acknowledged in the system card by Anthropic. The GPQA Diamond, an expert science test, slipped from 94.2 to 93.6. Refusal capabilities and resistance to prompt injections in computer use scenarios have regressed, making 4.8 more susceptible to manipulation in Agent scenarios. Additionally, in a year-long simulated business test, it ended up with only one-third of the cash 4.7 had left. These were not featured on the scorecard, making them all the more noteworthy.


Where Does It Stand Compared to Open Source Weight Models


The release card only compared Opus 4.8 to other closed-source cutting-edge models. If we expand our view to the inexpensive open-source weight models that many teams are currently testing, the landscape is almost a microcosm of the 2026 AI industry: Opus 4.8 leads in capability, but the gap between it and free, self-hostable models is now only a few percentage points, while the price difference is enormous.



The chart includes a full comparison of eight models. DeepSeek's price reflects its permanent 75% discount; Qwen Max's price has not been disclosed yet.


Opus 4.8 directly wins in the encoding benchmark. However, the open-source model Qwen3.7-Max, which you can run yourself, achieves a score of 60.6, trailing by only about 9 points. DeepSeek V4-Pro scores 55.4, and its output price is approximately only one-thirtieth of Opus's. For the most high-stakes engineering tasks, the $25 difference per million output tokens is worth paying. For a large amount of daily work, this difference is becoming increasingly less justified. And this is the calculation that every serious team is now making.


What Does This Mean for You


If you are using Opus 4.7, this is a free upgrade. The price remains the same, the data is better, and the judgment of its own output is significantly more reliable. Just make the switch.


The more interesting question is: what work are you now willing to hand over to it? Every reader has a line in their mind, distinguishing between "tasks I can let AI do" and "tasks I must do myself because I still cannot trust delegation." The reliability improvement of 4.8 means you can move this line forward. The model is better at pointing out its uncertainty, which reduces the cost of "silent error delegation" and expands the range of tasks worth entrusting to the model. This is the practical meaning of honesty data, which is more important than any single score.


This also echoes what we wrote about last week. Anthropic's own AI Fluency research found that when the model's output looks polished and complete, people are significantly less likely to notice missing context. The answer looks already finished, so we stop checking. Opus 4.8 attacks this failure mode from the model side: it is better at telling you where a seemingly clean and complete answer may still have vulnerabilities. It cannot replace your judgment, but it can provide leverage for your judgment.


If you're using Claude Code, this week you can take on a truly large task to try out a dynamic workflow, such as a migration or a comprehensive check of a large number of files, while also paying attention to the token meter. This capability is real, and adversarial self-check is also key to making the output more trustworthy. But the cost is real too. This tool is designed for tasks that are difficult for a single Agent to complete and should not be your everyday default option.


Next: Mythos, Arriving in a Few Weeks


The most forward-looking statement in this release is not actually about 4.8. Anthropic has announced that the Mythos-level model will arrive in the next few weeks, positioning Opus 4.8 as a public step towards it.


You need to understand what this means. Mythos is a restricted cutting-edge model that Anthropic has been benchmarking internally, surpassing Opus 4.8 in almost all metrics: achieving 93.9% on SWE-bench Verified; in network security testing, it can generate executable exploits against most targets in current browsers, while Opus 4.8 has a success rate of less than 10%. It was previously only open to 52 audited institutions, priced at five times the standard Opus, treated as infrastructure rather than a regular product.


Therefore, when a more powerful Mythos-level model lands in the next few weeks, it should be understood through a "two-tier market" framework: one is the commodified layer, Opus 4.8, widely open, unchanged in price, increasingly competing with free open-source models; the other is the controlled cutting-edge layer, Mythos, expensive, with limited access. These two are not separate products but different levels on the same continuous capability line. The reliability work in 4.8 is what you must build before the real goal of "running the model with less supervision." And this goal is now not a few quarters away, but a few weeks.


Background: How We Got to This Point


If you've lost track of the rhythm of the past four months, here's a way to understand it: Opus 4.6 in February brought the Agent team, Sonnet 4.6 brought price collapse, Opus 4.7 in April brought about reasoning leap, and Mythos was the faintly visible restricted ceiling next door. Opus 4.8 connected these two clues: it continued the narrative arc of 4.6 while also being the gateway to Mythos.


This cadence of releases itself is the key fact hidden beneath all surface changes. The flagship model has progressed from 4.5 to 4.6 to 4.7 within a few months, and the model you are adopting for team standardization today may not be the same model you are actually running by autumn. This is why, instead of investing in mastering a specific model, it is more important to invest in the ability to migrate across models, such as clear delegation and rigorous validation.


Benchmarking sweeps will receive screenshot propagation. But the real changes are smaller and more crucial: this is the first Claude version, and its core selling point is no longer just "it is smarter," but "you can entrust more things to it." Before Agent truly becomes useful, the entire industry must move in this direction; and this part of the capability is also the most difficult to fit into a single chart.


Where is your boundary now? Which tasks are you willing to delegate to the model, and which ones still have to be done by yourself? What needs to happen to make you willing to push this boundary further?



[Original Article Link]



Welcome to join the official BlockBeats community:

Telegram Subscription Group: https://t.me/theblockbeats

Telegram Discussion Group: https://t.me/BlockBeats_App

Official Twitter Account: https://twitter.com/BlockBeatsAsia