aimode.news
Published on

Updates to recent Claude code quality reports

Authors

Receive the developer newsletter

Product updates, how-tos, community spotlights and more. Delivered monthly to your inbox.

Over the past month, we've reviewed reports that Claude's responses have deteriorated for some users. We attributed these reports to three separate changes that affected Claude Code, Claude Agent's SDK, and Claude Cowork. The API was not impacted.

All three issues have now been fixed on April 20 (v2.1.116).

In this article, we explain what we found, what we fixed, and what we will do differently to ensure that similar issues are much less likely to occur again.

We take reports of damage very seriously. We never intentionally degraded our models and we were able to immediately confirm that our API and inference layer were not affected.

After investigation, we identified three different issues:

top

to medium

to reduce very long latency (enough that the UI appears frozen), some users were seeing high

fashion. It was not a good compromise. We reversed this change on April 7 after users told us they preferred to default to higher intelligence and opt for lower effort for simple tasks. This impacted Sonnet 4.6 and Opus 4.6. Since each change affected a different slice of traffic on a different schedule, the overall effect looked like a broad and inconsistent degradation. Although we began reviewing the reports in early March, they were initially difficult to distinguish from normal variations in user feedback, and neither our internal usage nor our evaluations replicated the issues initially identified.

This is not the experience users should expect from Claude Code. As of April 23, we are resetting usage limits for all subscribers.

When we released Opus 4.6 in Claude Code in February, we set the default reasoning effort to a high level.

.

Shortly after, we received feedback from users that Claude Opus 4.6 in high effort mode was sometimes thinking too long, causing the UI to appear frozen and leading to disproportionate latency and token usage for those users.

In general, the longer the model thinks, the better the result. Effort Levels are Claude Code's way of allowing users to define this trade-off: more thinking versus lower latency and fewer usage limits. When we calibrate effort levels for our models, we take this tradeoff into account in order to select points along the test-time-computation curve that provide users with the best range of options. In the product layer, we then choose the point on this curve that we set as default, and that's the value we send to the Messages API as an effort parameter; we then make the other options available via /effort

.

In our internal evaluations and tests, medium effort achieved slightly lower intelligence with significantly lower latency for the majority of tasks. It also didn't suffer from the same issues, with occasional very long latencies for reflection, and it allowed users' usage limits to be maximized. As a result, we deployed a change making support the default effort and explained the rationale via an in-product dialog.

Shortly after deployment, users began reporting that Claude Code felt less intelligent. We delivered a number of design iterations to make the current effort setting clearer to alert users that they could change the default (startup notice, online effort selector, and Ultrathink feedback), but most users kept the default of medium effort.

After hearing feedback from more customers, we reversed this decision on April 7. All users now default to xhigh.

effort for Opus 4.7, and high

effort for all other models.

When Claude reasons about a task, that reasoning is normally kept in the conversation history so that on each subsequent turn, Claude can see why he made the changes and the tool calls he made.

On March 26, we shipped what was supposed to be an efficiency improvement for this feature. We use fast caching to make consecutive API calls cheaper and faster for users. Claude writes entry tokens to the cache when he makes an API request, then after a period of inactivity, the prompt is evicted from the cache, making way for other prompts. Cache usage is something we manage carefully (more on our approach).

The design should have been simple: if a session was idle for more than an hour, we could reduce the cost of resuming that session for users by clearing old thinking sections. Since the query would miss the cache anyway, we could remove unnecessary messages from the query to reduce the number of uncached tokens sent to the API. We would then start sending the full reasoning history again. To do this we used clear_thinking_20251015

API header with keep:1

.

The implementation had a bug. Instead of clearing the thought history just once, he cleared it every round for the rest of the session. After a session crossed the inactivity threshold once, each request for the remainder of that process would ask the API to keep only the most recent block of reasoning and remove everything before it. This got worse: if you sent a follow-up message while Claude was in the middle of a tool use, it started a new round under the broken flag, so even the current round's reasoning was abandoned. Claude would continue to execute, but increasingly without memory of why he chose to do what he did. This showed up in the form of oversights, repetitions, and strange tool choices people reported.

Since this would continually eliminate reflection blocks from subsequent queries, these queries also resulted in cache misses. We believe this is what led to the various reports on usage limits flowing faster than expected.

Two unrelated experiences made it difficult to reproduce the issue at first: an internal-only server-side experience related to message queuing; and an orthogonal change in how we display thought removed this bug in most CLI sessions, so we didn't detect it even when testing external versions.

This bug was at the intersection of Claude Code's context management, the Anthropic API, and extended reflection. The changes introduced made it possible to pass several human and automated code reviews, as well as unit testing, end-to-end testing, automated verifications and dogfooding. Combined with the fact that this only happens in an isolated case (stale sessions) and the difficulty of reproducing the issue, it took us over a week to discover and confirm the root cause.

As part of the investigation, we tested Code Review against the offending pull requests using Opus 4.7. When given the necessary code repositories to gather a complete context, Opus 4.7 found the bug, while Opus 4.6 did not. To prevent this from happening again, we now support additional repositories as context for code reviews.

We fixed this bug on April 10 in v2.1.101.

Our latest model, Claude Opus 4.7, has a notable behavioral quirk compared to its predecessor: as we wrote at launch, it tends to be quite verbose. This makes it smarter on hard problems, but it also produces more output tokens.

A few weeks before the release of Opus 4.7, we started tuning Claude Code in preparation. Each model behaves slightly differently and we spend time before each ride optimizing the harness and corresponding product.

We have a number of tools to reduce verbosity: model training, incentives, and improving UX thinking in the product. Ultimately we used all of this, but one addition to the system prompt had an outsized effect on intelligence in Claude Code:

"Length limits: Limit text between tool calls to ≤ 25 words. Limit final responses to ≤ 100 words unless the task requires more detail."

After several weeks of internal testing and no regressions in any of the evaluations we performed, we were confident about the change and shipped it with Opus 4.7 on April 16.

As part of this investigation, we performed more ablations (removing lines from the system prompt to understand the impact of each line) using a broader set of evaluations. One of these evaluations showed a 3% drop for Opus 4.6 and 4.7. We immediately canceled the prompt in the April 20 release.

We will do several things differently to avoid these issues: we will ensure that more internal staff use exactly the public version of Claude Code (as opposed to the version we use to test new features); and we will make improvements to our code review tool that we use internally and ship this improved version to customers.

We're also adding tighter controls on changes to system prompts. We will perform a broad suite of model evaluations for each rapid system change to Claude Code, continuing ablations to understand the impact of each line, and we have created new tools to facilitate the review and auditing of rapid changes. We've also added guidance to our CLAUDE.md to ensure that model-specific changes are tailored to the specific model they're targeting. For any changes that could compromise intelligence, we will add absorption periods, a broader evaluation suite, and phased deployments to detect issues earlier.

We recently created @ClaudeDevs on X to give us the opportunity to explain product decisions and the reasoning behind them in depth. We will share the same updates in centralized discussions on GitHub.

Finally, we would like to thank our users: the people who used /feedback

order to share their issues with us (or who have posted specific, reproducible examples online) are the ones that ultimately allowed us to identify and resolve these issues. Today we are resetting usage limits for all subscribers.

We are extremely grateful for your feedback and for your patience.

Product updates, how-tos, community spotlights and more. Delivered monthly to your inbox.

Updates to recent Claude code quality reports | aimode.news