GitHub

Get the developer newsletter

Product updates, guides, community spotlights and more. Is delivered monthly to your inbox. Written by Nicholas Carlini, a researcher in our Safeguards team. I experimented with a new approach to monitoring language models that we call “agency teams”. Agent teams work several Claude- instances parallel on a common code base without active human intervention. This approach broadens the scope of what LLM-Agencies can be reached. To put it under a stress test, I asked 16 agents to take one Rustic-based C compiler from scratch, which is able to Linux- To compile kernel. In nearly 2,000 Claude code sessions and $20,000 API- The agent team created a 100,000-line compiler that can create Linux 6.9 on x86, ARM and RISC-V. The compiler is in itself an interesting artifact, but I focus here on what I have learned about designing systems for long-term autonomous agent teams: how to write tests, keep agents on track without human supervision, how to structured the work so that several agents can make progress in parallel, and where this approach is reaching its limits. Existing agent scaffolds such as Claude Code require an operator to be available online and for common work. If you ask for a solution for a long and complex problem, the model may solve a part of it, but at some point it remains and waits for further inputs – a question, a status update or a request for clarification. In order to bring about sustainable, autonomous progress, I have built a belt that holds Claude in a simple loop (if you have seen Ralph-Loop, you may have known that). If a job is done, it'll be the next one. (Do this in a container, not on your actual computer.)

♪

while true; do

COMMIT=$(git rev-parse --short=6 HEAD)

LOGFILE="agent logs/agent ${COMMIT}.log"

claude --dangerously-skip-permissions \

-p "$(cat AGENT PROMPT.md)"

--model claude-opus-X-Y &> "$LOGFILE"

finished

In the agent's prompt, I tell Claude what problem he is to solve and urge him to tackle the problem by breaking it into small parts, following what he is currently working on, finding out what he is to work on next, and effectively continuing until it is perfect. (Applause from the left) The loop runs forever – although I saw in a case Claude pkill -9 bash

accidentally, kills itself and ends the loop. Hoppla!). Due to the parallel execution of several instances, two vulnerabilities of a single agent system can be eliminated:

My implementation of Parallel Claude is simple. A new Bare-Git-Repo is created and for each agent becomes one Docker-Container raised, where the repo is mounted in /upstream

. Each agent clones a local copy after /workspace

, and when it is finished, it is moved from its own local container to the upstream. To prevent two agents at the same time trying to solve the same problem, the harness uses a simple synchronization algorithm:

This is a very early research prototype. I have not yet implemented any other method of communication between agents and also do not impose a process for managing higher-level goals. I don't use an orchestra. Instead, I leave it to every Claude agent to decide how he is going. In most cases, Claude tackles the “most obvious” problem. If he finds a mistake, Claude often runs a continuous document with failed approaches and remaining tasks. In the Git repository of the project, you can read the course and see how barriers are removed for different tasks. The scaffold performs Claude in a loop, but this loop is only useful when Claude can say how he progresses. Most of the troubles I used to make the environment around Claude – the tests, the environment, the feedback – so that it can be oriented without me. These are the approaches that I felt most helpful in orchestrating several Claude instances. Claude will work independently to solve any problem I give him. Therefore, it is important that the task verifyer is almost perfect, otherwise Claude solves the wrong problem. To improve the test scope, it was necessary to find high-quality compiler test suites, to write verifiers and build scripts for open source software packages, to pay attention to errors that Claude made, and then to design new tests when I identified these error modes. Towards the end of the project, Claude, for example, began to interrupt frequently existing functions in each implementation of a new function. To tackle this problem, I have created a continuous integration pipeline and implemented a stricter enforcement that enabled Claude to better test his work so that new commits cannot destroy the existing code. I had to keep in mind that I wrote this test program for Claude and not myself, which meant that I had to rethink many of my assumptions about how tests should communicate results. For example, each agent is placed in a new container without a context and requires a lot of time for orientation, especially in large projects. Before we get to the tests at all, I added instructions to Claude to help herself to maintain extensive READMEs and progress files that should be regularly updated with the current status. I also took into account the fact that language models have inherent limitations that had to be bypassed in this case. This includes:

-- fast

Option that performs a random sample of 1% or 10%. This partial sample is deterministic per agent, but randomly across VMs, so Claude still covers all files, but every agent can perfectly identify regressions. If there are many different failed tests, parallelization is trivial: Each agent selects another failed test he wants to work on. After the test suite had reached a success rate of 99%, each agent worked to compile another small open source project (e.g. SQlite, Redis, libjpeg, MQuickJS, Lua). But when agents began compiling the Linux kernel, they remained stuck. Unlike a test suite with hundreds of independent tests, compiling the Linux kernel is a huge task. Each agent encountered the same error, caused this error and then overwritten the changes of the other. It has not helped to run 16 agents because everyone stuck in solving the same task. The solution was to use GCC as an online Oracle with well-known compilers for comparison. I wrote a new test environment that randomly compiled most of the kernel with GCC and only the remaining files with Claudes C compiler. When the kernel worked, the problem was not in Claude's subset of files. If it breaks, it could be further refined by recompiling some of these files with GCC. As a result, each agent could work in parallel and fix various errors in different files until Claudes Compiler could finally compile all files. (After this worked, it was still necessary to use delta debugging techniques to find file pairs that fell together, but worked independently from each other.)

Parallelity also allows specialization. LLM-written code often implements existing functions new, so I have commissioned an agent to merge all found double codes. Another thing I've done is to improve the performance of the compiler itself, and to a third, I've transferred the task of spending efficiently compiled code. I asked another agent to criticize the design of the project from the perspective of a Rust developer and to make structural changes to the project to improve the overall code quality, and another to work on the documentation. This project was designed as a skill benchmark. I am interested in the limits of what LLMs Today, we can still achieve a stress test to help us prepare for what models will be able to achieve in the future. I used the C-Compiler project as a benchmark for the entire Claude-4 model series. As with previous projects, I started designing what I wanted: a reoptimized compiler without dependencies, GCC compatible, able to compile the Linux kernel, and designed to support multiple backends. Although I have specified some aspects of the design (e.g. that it should have an SSA-IR to allow several optimization passages), I did not mention how it is. Previous Opus-4 models were hardly able to create a functional compiler. Opus 4.5 was the first company to surpass a threshold that enabled him to develop a functional compiler that could consist of large test series, but it was still unable to compile really large projects. My goal with Opus 4.6 was to test the boundaries again. In almost 2,000 Claude code sessions over two weeks, Opus used 4.6 2 billion input tokens and generated 140 million output tokens, which corresponds to a total cost of nearly $20,000. Compared to the most expensive plans of Claude Max, this was an extremely expensive project. But this sum is only a fraction of what it would cost me to produce it yourself – not to mention an entire team. This was a clean room implementation (Claude had no time during the development of Internet access); it depends only on the Rust standard library. The 100,000-line compiler can create a bootable Linux 6.9 on x86, ARM and RISC-V. It can also compile QEMU, FFmpeg, SQlite, Postgres, Redis and has a success rate of 99% for most compiler test suites, including the GCC-Folter test suite. There is also the ultimate paint test of the developer: It can compile and perform Doom. However, the compiler is not without restrictions. This includes:

The resulting compiler has almost reached the limits of Opus' skills. I tried (with great effort!) to fix some of the above restrictions but was not quite successful. New functions and bug fixes frequently affect the existing functionality. An especially challenging example: Opus was unable to implement a 16-bit x86 code generator that was required to boot in 16-bit real mode. While the compiler can output correct 16-bit-x86 via the 66/67 optcode prefixes, the resulting compiled output is over 60 KB and thus exceeds the code restriction imposed by Linux by 32 KB by far. Instead, Claude cheats here simply and calls GCC for this phase (This is only the case for x86. For ARM or RISC-V, Claudes Compiler can compile itself completely.)

The source code for the compiler is available. Download it, read the code and try it in your preferred C projects. I have always found out that the best way to understand what language models can afford is to bring them to their borders and then investigate where they start to fail. In the coming days, I will continue to allow Claude to make new changes if you want to follow Claude's continued attempts to address these limitations. Each generation of language models opens up new opportunities to work with them. Early models were useful for tab completion in IDEs. Models could soon complete a functional body on the basis of its document string. The introduction of Claude Code brought agents to the mainstream and enabled developers to pair programming with Claude. In each of these products, however, it is assumed that a user defines a task, an LLM is executed for a few seconds or minutes and returns a response and the user then provides a tracking. Agent teams show the opportunity to implement complete, complex projects independently. As a user of these tools, we can pursue our goals more ambitiously. We are still at the beginning and a completely autonomous development involves real risks. When a human being Claude is on the side of development, he can ensure consistent quality and detect errors in real time. In the case of autonomous systems, one can easily see that tests were passed and assume that the work is done, although this is rarely the case. I used to work in the field of penetration tests and used weaknesses in products from large companies. The idea that programmers use software they have never personally checked is a real problem. While this experiment enthuses me, it also triggers discomfort with me. The development of this compiler has been the most fun of me lately, but I would not have expected this to be possible as early as 2026. The rapid progress in both language models and the frameworks we use for interaction with them opens the door for writing an enormous amount of new codes. I expect the positive applications to outweigh the negative, but we enter a new world that requires new strategies to safely navigate. Special thanks to Josef Bacik, Edwin Chen, Bernardo Meurer Costa, Jake Eaton, Dan Kelley, Felix Klock, Jannet Park, Steve Weis and many others Anthropic for their help and contributions. Product updates, guides, community spotlights and more. Is delivered monthly to your inbox.

![Building a C compiler with a team of parallel Claude](https://cdn.sanity.io/images/4zrzovbb/website/6cc87859f5453e9481278681aa6409856d61153c-2400x1260.png)