Post- Analysis of Three Problems

Get Developer Communications

Product upgrading, operating methods, community focus, etc. Send it to your inbox every month. Between August and early September, three infrastructure errors fell intermittently. Claude. Quality of response. We have now solved these problems and want to explain what happened. Early in August, some users began to report a decline in the quality of Claude ' s response. It is difficult to distinguish these initial reports from normal changes in user feedback. By the end of August, the increasing frequency and continuity of those reports had prompted us to investigate and to discover three separate infrastructure errors. Simply put: We will never lower the quality of the model because of demand, the time of the day or the server load. The problems reported by our users are due only to infrastructure deficiencies. We recognize that the users expect Claude to provide consistent quality, and we maintain very high standards to ensure that infrastructure changes do not affect model outputs. We have not met that bar in the recent events. The following ex post analysis explains what the problem is, why it took longer to detect and resolve the problem than we expected, and what changes are being made to prevent similar events in the future. We do not usually share the technical details of this level of our infrastructure, but the scope and complexity of these issues warrant a more comprehensive interpretation. We're going through our first side. APII don't know.Amazon Bedrock and Google Cloud's Vertex AI serves Claude. We're deploying Claude on multiple hardware platforms. AWS Trainium, NVIDIA GPU and Go.I don't know. TPUI don't know. It provides the capacity and geographical distribution needed to provide services to users worldwide. Each hardware platform has different characteristics and requires specific optimization. Despite these differences, we have strict equivalence standards for models. Our goal is that, regardless of which platform meets a user ' s request, the user should receive a response of the same quality. This complexity means that any infrastructure change requires careful validation across all platforms and configurations. The overlapping nature of these errors makes diagnosis particularly challenging. The first error was introduced on 5 August, affecting about 0.8% of Sonnet 4 requests. Two further errors occurred in the deployment on 25 and 26 August. Although the initial impact was limited, the load balance change of 29 August began to increase the affected traffic. This leads to problems for more users, while others continue to see normal performances, resulting in confusing and contradictory reports. The following is a description of the three errors that led to the decline in performance, the time when they occurred and how we addressed them:

On 5 August, a number of Sonnet 4 requests were misrouted to the server configured for the upcoming 1M token context window. This error initially affected 0.8 per cent of requests. On 29 August, routine load balance changes inadvertently increased the number of short context requests for route to 1M context server. At the most seriously affected moment on 31 August, 16 per cent of Sonnet 4 requests were affected. Of the Claude Code users who sent out requests during this period, approximately 30 per cent had at least one message that had been routed to the wrong server type, leading to lower response quality. On Amazon Bedrock, since August 12, error traffic has peaked at 0.18 per cent of all Sonnet 4 requests. Between 27 August and 16 September, the error route affected less than 0.004 per cent of requests on Google Cloud Vertex AI. However, some users have been more severely affected, as our route is “gold”. This means that in the event that the request is provided by the wrong server, subsequent follow-up requests are likely to be provided by the same wrong server. Solutions: We have repaired the route logic to ensure that short and long context requests are directed to the right server pool. We deployed the restoration program on September 4. The deployment of our first party platform and the Vertex AI of Google Cloud was completed on 16 September and the deployment of AWS Bedrock was completed on 18 September. On August 25th, we deployed the wrong configuration to the Claude API TPU server, resulting in an error during the token creation. Problems arising from the optimization of running-time performance sometimes assign high probability to markings that are rarely generated in given contexts, such as the production of Thai or Chinese characters based on English tips, or the creation of significant grammatical errors in code. For example, a small number of users asking questions in English may see the word “single” in the middle of the response. This damage affected requests sent to Opus 4.1 and Opus 4 from 25 to 28 August and to Sonet 4 from 25 August to 2 September. Third-party platforms are not affected by this issue. Solutions: We discovered the problem and rolled back on September 2nd. We have added testing for accidental character output during deployment. On August 25th, we deployed codes to improve the way Claude selects tags during text generation. This change inadvertently triggered a potential error in the XLA: TPU [1] compiler, which was confirmed to affect the request for Claude Haiku 3.5. We also think this could affect the Sonnet 4 and Opus 3 subsets on Claude API. Third-party platforms are not affected by this issue. Solutions: We first observed errors affecting Haiku 3.5 and rolled it back on 4 September. Later, we noticed the user report's Opus 3 problem, which was compatible with this error, and rolled it back on September 12. After extensive investigation, we were not able to repeat this error on Sonnet 4, but, out of caution, we decided to roll it back. At the same time, we (a) have been working with the XLA:TPU team to repair the compiler error and (b) to introduce the fixer to use accurate top-k and to improve accuracy. For more information, please refer to the further discussion below. To illustrate the complexity of these issues, the following are the reasons why the XLA compiler was wrongly presented and the diagnosis was particularly difficult. When Claude generates text, it calculates the probability of each possible next word and then randomly selects a sample from this probability distribution. We use "top-p sampling" to avoid meaningless output - only to consider words where the cumulative probability reaches the threshold (usually 0.99 or 0.999). On TPU, our model runs across multiple chips, and probability calculations are done in different locations. To sequence these odds, we need to coordinate data between chips, which is complicated. [2]

In December 2024, we found that our TPU realization occasionally discarded the most possible tokens at zero temperature. We have deployed a solution to this problem. The underlying cause is a hybrid precision algorithm. Our model calculates the probability of the next mark by bf16 (16-bit floating point). However, the vector processor is original of fp32, so the TPU compiler (XLA) can optimize the running time by converting certain operations to fp32 (32 bit). This optimization process is protected by xla allow execs prision

The sign is default. This leads to mismatch: the operation that should have agreed on the highest probability token is running at different levels of precision. The lack of precision means they cannot agree which token has the highest probability. This leads to the elimination of the highest probability tokens from consideration. On August 26, we rewrited the sample code to address the accuracy problem and to improve the way we handle the probability of reaching the top-p threshold. But in addressing these problems, we have revealed a more difficult problem. Our restoration removed the December solution because we believed we had solved the root causes. This leads to a more profound error in approximation of top probability tags in top-k operations - the performance optimization of fast-tracking. [3] This approximation sometimes returns the result of a complete error, but is limited to certain batch sizes and model configurations. The December solution inadvertently obscured the problem. The error was inconsistent and frustrating. It changes depending on unrelated factors, such as previous or subsequent operations, and whether the debugging tool is enabled. The same hint may run perfectly on one request, but may fail on the next request. In the course of the investigation, we have also found that the exact top-k operation no longer results in prohibitive performance losses, as it was before. We switch from approximation to precise top-k and standardize some additional operations for fp32 accuracy. [4] The quality of the model cannot be negotiated, so we accept the lesser efficiency impact. Our certification process usually relies on benchmarks and safety assessment and performance indicators. The engineering team carried out spot checks and first deployed to a small canary team. These problems have exposed key gaps that we should have identified as early as possible. Our assessment does not at all reflect the decline in performance reported by users, in part because Claude has often recovered from isolated errors. Our own privacy practices also pose challenges to the investigation report. Our internal privacy and security controls limit the way and time that engineers visit users to interact with Claude, especially when these interactions are not reported back. This protects the privacy of the user, but prevents the questionable interaction that engineers need to check for errors to be identified or repeated. Each error produces different symptoms at different speeds on different platforms. This has resulted in a confusing mix of reports without any single cause being identified. It looks like random, inconsistent degradation. More fundamentally, we rely too much on noisy assessments. While we are aware of the increase in online reports, we lack a clear way of linking them to each of our recent changes. When the negative report surged on August 29th, we did not immediately link it to other standard load balance changes. As we continue to improve our infrastructure, we are also improving the way we assess and prevent the above-mentioned errors on all platforms that serve Claude. Here's what we want to change:

Evaluation and monitoring are important. But these events show that when Claude's response falls short of the usual standards, we also need continuous signals from users. Reports of specific changes observed, examples of unintended behaviour encountered and models of different examples have helped to isolate us. It remains particularly helpful for users to continue to send feedback directly to us. You can use /bug

The command in Claude Code, or you can use the thumb-down button in Claude application. Developers and researchers often create new and interesting ways to assess the quality of models to complement our internal testing. If you want to share your content, contact feedback@anthropic.com. We remain grateful to the community for these contributions. By Sam McAllister, thanks to Stuart Ritchie, Jonathan Gray, Kashyap Murali, Brennan Saeta, Oliver Rausch, Alex Palcuie, etc. [1] XLA:TPU is an optimised compiler that converts an XLA advanced optimised language (usually written with JAX) to a TPU machine command. [2] Our model is too big for a single chip and spreads on dozens or more, making our sorting a distributed sort of thing. TPUs (like GPUs and Trainium) also have different performance characteristics than CPUs and require different realization techniques using vector quantification instead of serial algorithms. [3] We've been using this approximation because it brings about significant performance improvements. This approximation works on the basis of the potential inaccuracies in the acceptance of the lowest probability token, which does not affect quality unless the error results in it discarding the highest probability token. [4] Please note that the correct top-k realization now may lead to minor differences in tokens containing near the top-p threshold and, in very few cases, users may benefit from adjusting their top-p selection. Product upgrading, operating methods, community focus, etc. Send it to your inbox every month.

![Post- Analysis of Three Problems](https://cdn.sanity.io/images/4zrzovbb/website/412be842c5c6bae6b4bcd515c191b0aa5015e05f-2400x1260.png)