GitHub

Get Developer Communications

Product upgrading, operating methods, community focus, etc. Send it to your inbox every month.

Agent-coding benchmarks such as SWE-bench and Terminal-Bench are usually used to compare the software engineering capabilities of front-line models, and are usually ranked at the top of the list by only a few percentage points. These scores are generally considered to be accurate measurements of relative model capabilities and increasingly inform decision-making on which models are deployed. We find, however, that infrastructure alone can produce differences that exceed those profits. In internal experiments, the gap between the most and least resourced settings on Terminal-Bench 2.0 was 6 percentage points (p < 0.01).

Static benchmark tests rate the output of the model directly - the environment does not affect the outcome when running. The proxy coding assessment is different: the model is given a complete environment in which the program, the running test, the installation of the dependencies and the iterative multiple rounds are prepared. It is no longer a passive container but an integral part of the solution. Two agents with different resource budgets and time constraints did not perform the same tests.

Eval developers are already considering this. For example, Terminal-Bench 2.0 specifies recommended for each task in the latest version of 2.0 CPU And RAM. However, the designation of resources does not amount to consistent enforcement of them. In addition, we found that the implementation method could change the content of the ultimate physical measurement of the benchmark.

We're here. Google Kubernetes The engine cluster runs Terminal-Bench 2.0. And when we set it on a school basis, we noticed that our scores were not in line with the official ranking of the benchmark, and that the infrastructure error rate was surprisingly high: up to 6 per cent of the jobs failed because of the Pod error, most of which had nothing to do with the ability of the model to solve the task.

Differences in scores are attributed to performance. Our Kubernetes achieves that the resource specifications for each mission are considered as floors and hard ceilings: Each container guarantees a specified resource but is killed when it exceeds it. Packagings are operated with the mandatory use of resources by means of two separate parameters: guaranteed distribution (preserve resources) and hard limits on the termination of containers. When they are set to the same value, the balance of the transient peak is zero: An instant memory fluctuations could kill a container that would have been successful. With this in mind, Terminal-Bench used different sandbox providers, and its application was more liberal, allowing temporary over-allocation without termination of the container in order to contribute to the stability of the infrastructure.

This finding raises a larger question: How far does the allocation of resources affect the assessment score?

To quantify the impact of scaffolding, we run Terminal-Bench 2.0 on the six resource allocations, from strict implementation of each of the mission norms (1x), to their simultaneous use as floors and ceilings, to total freedom. Everything else remains the same: the same Claude. Models, same seat belts, same set of tasks.

In our experiments, the success rate increases as the resource space increases. This is mainly due to the fact that the base error rate decreased in a single way in each step, from 5.8 per cent in strict implementation to 0.5 per cent in non-ceiling. The decline between strict implementation and a three-fold net vacuum (5.8 to 2.1 per cent) is significant, at p. < 0.001. With more clean, fewer containers will be killed because they exceed their distribution.

From 1 to 3 times, the success score fluctuates within the noise range (p=0.40). In any case, most missions that collapse at a factor of one -- as we see in the data -- fail. The agent explored, met with a wall of resources and was robbed, but it was never on the path to the right solution.

However, this trend has changed since about three times: the rate of success has increased faster than the rate of decline in underlying errors.

Between three times and no ceiling, the base error decreased by an additional 1.6 percentage points, while the success rate jumped by nearly 4 percentage points. Additional resources enable agents to try methods that apply only to a large number of allocations, such as the introduction of large dependencies, the generation of expensive subsystems and the running of memory-intensive testing packages. Without resource constraints, the total increase over 1 times is +6 percentage points (p < 0.01). On the edge, missions like rstan-to pystan

And compile-compcert

When memory space is acquired, their success rates are significantly higher.

The additional resources, up to approximately three times as high as the Terminal-Bench specifications, can address the issue of the reliability of infrastructure, i.e., the instant resource peak. The sandbox supply program used by the Terminal-Bench maintainer is behind the scenes; the assessment is more stable than easier.

However, after exceeding the three-fold mark, additional resources began to actively assist agents in resolving previously unsolved problems, suggesting that restrictions could actually change the content of the assessment measurements. Strict restrictions inadvertently reward very effective strategies, while more lenient restrictions reward agents who are better able to use all available resources.

Agents who are able to prepare a very quick and efficient code will perform well under strict constraints. Agents who use heavyweight tools for robust solutions will perform well with generous tools. Both are legitimate things worth testing, but folding them into single fractions without specifying the allocation of resources would make differences and the universality of the real world difficult to explain.

About bn-fit-modify

, a Terminal-Bench task that needs to be developed by the Bayesian network. The first step of some models is to install standards Python. Data science stack: pandas

,Netx

, scikit studies,

And all their tool chains. This is feasible under generous constraints. Under strict conditions, pod will run out of memory during installation before the agent prepares a line of solution code. There is a more streamlined strategy (the use of a standard library only to achieve mathematics from scratch) and some models use it by default. Others are not. Different models have different default methods, and the allocation of resources determines which approaches will succeed. We're different. Anthropic The core findings were replicated in the model. The impact is in the same direction, but in different degrees. The same trend seems to apply to models other than Claude, but we have not yet tested them rigorously.

We also tested the applicability of this model to assessments outside Terminal-Bench by running cross-tests on SWE-Bench. We will change the total RAM to five times the baseline for 227 questions (10 samples per question). The same effect remains, albeit small: with the increase in RAM, the fractions are increasing in a single way again, but only 1.54 percentage points higher than one in five times. The low resource intensity of the SWE base mission and the resulting expected impact are not expected, but it indicates that resource allocation is also not neutral.

Resource allocation is not the only hidden variable. In some configurations, time limits have also begun to work.

In principle, each element of the assessment set-up affects the final score, from the group operation to the hardware specifications, from the parallel to the export bandwidth. The proxy assessment is an end-to-end system test through construction, and any component of the system can act as a mixing factor. For example, we observed that the rate of passivity fluctuated over time in the day, probably because API Delays vary with traffic patterns and events. We have not yet formally quantified this impact, but it illustrates a broader view that the line between “model capabilities” and “infrastructure behaviour” is more blurred than is implied by a single benchmark score. Model providers can protect their assessment infrastructure from this impact through dedicated hardware, but external evaluators cannot easily do so.

Public benchmarks are usually designed to measure pure model functions, but in practice they run the risk of confusing them with infrastructure eccentricity. Sometimes this may be desirable because it can achieve end-to-end testing of the whole stack, but more often it is not. For coding assessments to be shared publicly, multiple operations over multiple days will help to eliminate noise.

Ideally, each assessment (scaffolding and reasoning stacks of running evaluation) would be operated under exactly the same hardware conditions, as this would ensure full and complete replicability. However, this may not always be feasible.

Considering how the container actually enforces its resources when running (through guaranteed distribution and a separate hard termination threshold), we suggest that evals assign two parameters to each task instead of a single fixed value. Individual precision norms set the allocation to a level equal to the termination threshold, leaving a zero margin: the peak of transient memory we recorded in 1x is sufficient to destabilize the assessment. The separation of these two parameters allows you to provide sufficient breathing space for the packaging to avoid a false OOM kill, while still enforcing a hard cap to prevent fractions from swelling.

The frequency bands between them should be calibrated so that scores on the floor and ceiling fall within each other ' s noise. For example, in Terminal-Bench 2.0, the three-fold upper limit per task specification reduced the base error rate by about two thirds (5.8 to 2.1 per cent, p < 0.001) while maintaining a moderate and full noise range of points (p = 0.40). This is a reasonable trade-off: the disruption of infrastructure has been largely eliminated without eliminating meaningful resource pressures. The exact multiplier will vary depending on the baseline and the distribution of tasks and should therefore be reported, but the empirical calibration principle is common.

The actual impact of these findings goes beyond assessing infrastructure. Benchmark scores are increasingly used as input for decision-making, but this increase in concern (and dependence) is not always accompanied by a corresponding rigour in the way they operate or report. In the current situation, the 2-point lead on the list may reflect real capacity differences, or may reflect an assessment of how to operate on more powerful hardware, even at a better time in a day, or both. Without the release (or standardized) configuration, it is difficult to judge from the outside unless interested parties make additional efforts to reproduce objective results under the same conditions.

For a laboratory such as Anthropic, this means that the resource allocation of the proxy assessment should be considered as a first-class experimental variable and recorded and controlled in the same rigour as the hint format or sample temperature. The release of recommended resource norms (as in Terminal-Bench 2.0) could be of great help to baseline maintainers, and the designation of implementation methods would reduce the gaps we find. For anyone who uses the results of the baseline tests, the central point is that the differences in the micro-points assessed by the proxy present a greater degree of uncertainty than implied by the accuracy of the data reported, especially since some of the mixed factors are difficult to control.

Until resource methods are standardized, our data suggest that the difference in rankings below 3 percentage points is questionable until the assessment configuration is documented and matched. The medium-range distribution of resources observed in Terminal-Bench is slightly less than 2 percentage points. The simple two-dimensional confidence zone has crossed 1-2 percentage points; the infrastructure confusion factors we record here are stacked over it, not in it. In the extreme case of distribution, the price difference is 6 per cent.

A few points in the lead may indicate a real capability gap, or it may be more virtual.

Prepared by Gian Saigato. Special thanks go to Nicholas Carlini, Jeremy Hadfield, Mike Merrill and Alex Shaw for their contributions. This exercise reflects the collective efforts of several teams working on coding agent assessments. Interested candidates are invited to apply on anthropic.com/careers.

Product upgrading, operating methods, community focus, etc. Send it to your inbox every month.

![インフラストラクチャtification of infrastructure noise in agent coding evaluation](https://cdn.sanity.io/images/4zrzovbb/website/16ff27ee039d98cf3391433f6f3df23aa96b1e91-2400x1260.png)