When Claude changed, everything changed: AI blast radius management in production

One thing our system has done, and it is good: it has questions in natural language in API- Calls converted. Users were analysts, account managers and manager. They knew what data they needed, but manual compilation required four dashboards, two BI tools and a Salesforce reporter. With our system they entered the request in simple English. An inquiry such as “Create a report on the sales volume for January to March 2026 for the region northeast, broken down by city” was translated into an API call to which the system could react:

json

♪

‘Description’: ‘The user has requested the sales volume for the specified period. Here is the API call to get the answer,”

‘api call’ means ‘/api/sales volume’;

"post body":

‘start date’ means ‘01.01.2026’;

‘end date’ means ‘31.03.2026’;

‘Region’: ‘Northeast’

}

The rest of the pipeline was conventional. The system forwarded the call to the right backend – we had integrations with internal reporting portals, Salesforce and several self-developed services – applied a large language model (LLM(generated) JSON-Request to filter and form the answer, and sent it by email, as a drive document or as a diagram in the browser. By mid 2025, the system generated several hundred reports per month. These reports were used by executives and analysts and distributed to external stakeholders. It has become a standard method for most teams to retrieve ad hoc data. The contract between the LLM and the rest of the system was a structured JSON object, as described in the above example. json

♪

‘Description’: ‘The user has requested the sales volume for the specified period. Here is the API call to get the answer,”

‘api call’ means ‘/api/sales volume’;

"post body":

‘start date’ means ‘01.01.2026’;

‘end date’ means ‘31.03.2026’;

‘Region’: ‘Northeast’

}

We started in 2025 Claude Sunt 3.5 created. We updated to 3.7 without incidents and to 4.0 without incidents. When Sonnet 4.5 was delivered, we were with the stability and predictability of LLMs in order to solve a problem which we believe is simple. Model upgrades have become routine, like uploading a secondary version of a well-functioning library. Then we introduced 4.5. For a reasonable percentage of inquiries, the model began to fold the content of post body into the description field. There followed two error modes. Firstly, the filter parameters never reached the API. Our system has read post body as a source of truth for the requirement load and this field was empty. The API call was performed without date range or region filter. Depending on the specific API called, the backend either returned the sales volume for all times or all regions or returned a 500 error. Secondly, the model began to ask clarifying questions in its answer. That was new. Earlier versions always followed the best-effort approach in a ambiguous request and returned a structured object. Sonnet 4.5 was more careful and sometimes replied with a question. Our system had no way to do this. It was built on the assumption that each model call would lead to an API call. There was no human-in-the-loop component and no status to save a partially completed request. This led to downstream systems being broken in various ways. We have set back to version 4.0. That was more difficult than it should have been: Between the deployments 4.0 and 4.5, our team added new API integrations that were all suitable for 4.5. The reversal of the model meant that each of them had to be qualified again at a time pressure of 4.0. Why traditional engineering discipline fails here

Software engineering is based on the ability to limit the impact of a change. If you update a driver or a library, read the release notes to determine whether important changes are to be expected. Unit tests limit what might have moved. You can use the following property: The system to change is deterministic enough that its behavior can be predicted or at least so tightly scanned that you can rely on it. The explosion radius is limited due to construction. LLM-based systems refute this assumption. The component that generates your output is not under your control. You cannot distinguish a model version increase from 4.0 to 4.5. It is a complete replacement of the functionality on which your system depends. This is what we mean with an infinite explosion radius: a change whose downstream effects cannot be counted in advance, as both the input space (natural language) and the error modes (all that the model could make different) are unlimited. Anatomy of failure

The obduction showed that our prompt was always insufficiently specified. We had told the model to return a JSON object with three fields. We had described what each field was meant for. We have not explicitly stated that the description must be a string in natural language and must not contain serialized representations of other fields. Previous versions of the model derived this restriction from the context. Sonnet 4.5 was obviously better to be “hilfreich” in its formatting options, and came to the conclusion that the answer is more useful when asking for clarification or providing the request text in the description. From the perspective of the model, this was a reasonable interpretation of a ambiguous instruction. However, this was against the assumptions under which our system was built. The error was not in the model. The error was our assumption that the model would continue to close our specification gaps as before. Three successful upgrades had taught us to believe that these gaps are safe. Structured output modes and tool APIs would have recognized this specific error at the schema level. We did not use it for technical reasons outside the scope of this article. But schemes only limit syntax, not semantics. A scheme cannot specify that a clarifying question should not be displayed in a system without a path for clarification or that a date range should never be set to “total time” in a tacit manner. Schematics solve the easier half of the problem. The evals-first architecture

The discipline that closes this gap is to treat the evaluation suite – and not the prompt – as a formal specification of the system. The prompt is an implementation of the specification. The model is an interpreter. The evaluations are the specification itself, and each model or prompt change is valid only if it exists. In practice, an evaluation is a trio group: an input, a property that must fulfill the output, and an evaluation function. For our system, the evaluation that the 4,5-regression would have captured looks like this:

Python

def description contains no serialized payload(response):

desc = Answer["Description"].lower()

prohibited = ["curl", "post body", "{", "http://", "https://"

do not claim any(tokens in desc for tokens in prohibited), \

f “Description has penetrated structured content: {response['description']}”

A few hundred such properties, some of which were written by hand for known important invariants, others were generated as regression tests from real production traffic and some of which were evaluated by an LLM as judge for more sharp qualities such as the tone, become a goal. Model upgrades and prompt changes should be treated as pull requests that the suite must make green before they are merged. Evals are expensive in production and maintenance. You move when your product changes. The LLM-as-Judge evaluation leads to its own variance in the results. And the suite can only detect error modes you have set – You can't evaluate your way to security using an error category that you never imagined. We have learned this lesson on the hard tour: No one in our team had ever written a statement saying: “The description field should not contain a Curl command,” because no one thought that the model would place one there. Evaluations are not all cure. They give you the opportunity to limit the explosion radius of a change to the only way available when the underlying function is a blackbox: by tightly scanning the input output reaction that is actually important to you, and the refusal to use it when this behavior changes. The Roadmap

The engineering community has to develop a knowledge base for writing effective evaluations. There are no generally accepted standards for what “cover” means in input spaces of natural language. CI/CD- Systems were not designed to detect probabilistic test results. Since agents increasingly take autonomous work – writing code, moving money, planning infrastructure changes – the gap between “the model has passed our smoke tests” and “we know what this system will do in production” becomes the central technical problem of the next few years. The teams that close this gap will be those who no longer consider evaluations as a subsequent aspect of quality assurance, but rather consider them as the actual specification of their system. Vijay Sagar Gullapalli is a founding AI engineer at Adopt AI and an inventor patented by the USPTO. Sarat Mahavratayajula is Senior Software Engineer at Sherwin-Williams. Welcome to the VentureBeat community! In our guest contribution program, technical experts exchange insights and provide neutral, unfounded insights into the topics AI, data infrastructure, cyber security and other cutting-edge technologies that shape the future of companies. Read more from our guest contribution program – and look at our policies if you are interested in writing your own article!

![When Claude changed, everything changed: AI blast radius management in production](https://images.ctfassets.net/jdtwqhzvc2n1/FUBn4ZqFiOMRJs1UcjN1O/fdd8bca8d1fc308cd3b42db61a981138/u7277289442_A_mushroom_cloud_of_data_is_exploding_against_a_d_859177f5-e487-44a4-a545-407c10edd87e_0.png?w=800&q=75)