- Published on
Effective Harness for Long-Running Agents
- Authors

- Name
- aimode.news
- @aimode_news
Get Developer Communications
Product upgrading, operating methods, community focus, etc. Send it to your inbox every month.
As artificial intelligence agents become more capable, developers are increasingly requiring them to undertake complex tasks that take hours or even days. However, the progress made in bringing the agent together in multiple context windows remains an outstanding issue.
The central challenge for long-duration agents is that they have to work in separate sessions and each new session begins without remembering what happened before. Imagine a software project, where engineers work in shifts and every new engineer arrives without remembering what happened in the previous shift. As the context window is limited and most complex projects cannot be completed in a single window, the agent needs a way to bridge the gap between coding sessions.
We've developed a two-fold solution. Claude. Proxy SDK Enables you to work effectively in multiple context windows: an initialifier sets the environment on the first run, and another encoder is responsible for making incremental progress in each session while leaving a clear message for the next session. You can find examples of codes in the accompanying fast-track entry.
Claude Agent SDK is a powerful universal proxy tool that is good at coding and requires models to use tools to collect context, plan and perform other tasks. It has context management functions, such as compression, which enables agents to handle tasks without exhausting context windows. Theoretically, in such a setting, the agent should be able to continue to do useful work for an arbitrary period of time.
However, compression alone is not enough. Even if the front-coding model used to open the box like Opus 4.5 recycles on Claude Agent SDK in multiple context windows, it will not be possible to construct a production-quality Web application if only advanced tips are given (e.g., "claude.ai's clone").
Claude's failure was in two ways. First, the agent tends to try too many things at once — essentially to try to complete the application at once. Usually, this results in the model being removed from context during its realization, so that the next session begins with a semi-realized and unrecorded function. The agent then had to guess what had happened and spent a lot of time trying to get the basic application running again. This happens even when compression is used, and compression does not always pass on fully clear instructions to the next agent.
A second pattern of failure usually occurs later in the project. After some functions have been constructed, a later proxy example will look around to see if progress has been made and declare that work has been completed.
This divides the problem into two parts. First, we need to create an initial environment that lays the foundation for all the functions required for the alert, so that agents can work step by step and function by function. Secondly, we should move every smart body towards its goal, while keeping the environment clean at the end of the session. What we call "clean state" is a code that is suitable for merging into the main branch: there are no major errors, the code is orderly and well documented, and generally developers can easily start developing new functions without first cleaning up unrelated confusion.
In our internal experiments, we addressed these problems using two-part solutions:
Initialization Program
Script, a claude-progress.txt file, which is used to record what the agent does, as well as an initial git that shows which documents are added. The key idea here is to find a way for the agent to get a quick look at the working state from the start of the new context window, which is done through claude-process.txt files and guit history records. These practices are inspired by what effective software engineers do every day.
In the updated Claude 4 tip guide, we shared some best practices on multi-text window workflows, including a line structure that uses different tips for the first context window. This "different tip" requires that the initialization program agent set up all the context necessary for the future coding agent to work effectively. Here, we have explored in greater depth some of the key components of such environments.
In order to address the one-time completion of the application by the agent or the premature belief that the project had been completed, we advised the initialization agent to prepare a comprehensive functional needs document to expand the user ' s initial hint. In the example of claude.ai cloning, this means more than 200 functions, such as "Users can open new chats, enter queries, press Enter key and then view AI responses." These functions were initially marked as “failures” so that later coding agents could clearly understand the look of the complete function.
Other Organiser
"Class": "Functions",
"Description": "New chat button to create new conversation,"
“Step”:[
"Guidance to main interface,"
"Click the `New Chat' button,
"Certify that a new dialogue has been created."
"Check if the chat area shows a welcome state."
"Certify whether the dialogue appears in the sidebar"
I don't know.
“Adoption”: false
♪ I'm sorry ♪
We suggest that the coding agent edits the file only by changing the status of the pass field, and we use strong-word instructions, such as “Deleting or editing tests are not acceptable because they may result in missing or defective functions”. After some experiments, we finally decided to use it. JSON This is achieved because the model is unlikely to change or overwhelm a JSON file as compared to the Markdown file.
Given this initial environmental scaffolding, the next iterative of the coding agent is required to handle only one function at a time. This incremental approach has proved to be crucial in addressing the tendency of agents to do too much at once.
Once incremental work has been done, it remains important for the model to keep the environment clean following code changes. In our experiment, we found that the best way to trigger this behaviour was to require models to submit their progress to guit with descriptive submissions and to include a summary of their progress in the progress document. This allows the model to use guit to restore the wrong code change and restore the code library working state.
These methods have also increased efficiency, as they eliminate the need for agents to speculate about what had happened and to spend time trying to get basic applications running again.
The last major pattern of failure we have observed is that Claude prefers to mark functionality as complete without proper testing. Without a clear hint, Claude prefers to change the code, even using cell or curl tests
Commands for the development of the server, but cannot be identified.
In the case of the construction of a Web application, Claude will perform well in end-to-end validation once he has made a clear hint to use the browser automation tool and perform all tests like human users.
The availability of such testing tools for Claude greatly enhanced performance, as agents were able to identify and repair errors that were not apparent only from the code.
A number of issues remained, such as Claude ' s vision and limitations on browser automation tools, making it difficult to identify each error. Claude, for example, cannot pass Puppeter MCP View the browser's own alarm mode, so that the function that relies on these models often results in more errors.
Upon completion of all the above-mentioned operations, each coding agent is instructed to implement a series of steps to understand its direction, some of which are very basic but still useful:
Password
View the directory you are working on. You can only edit files in this directory. This method saves Claude some tokens for each session because it does not have to figure out how to test the code. It also helps to require the initialization program agent to prepare an init.sh script that can run the development server and then run the basic end-to-end test before the new functionality is achieved.
In the case of claude.ai cloning, this means that the agent always initiates local development servers and uses Puppeteer MCP to start new chats, send messages and receive responses. This ensured that Claude was able to quickly identify whether the application was damaged and immediately repair any existing errors. The problem could be made worse if agents started to implement new functions.
In view of all this, the typical session began with the following supporting messages:
[Associate] I will first know my location and understand the current status of the project.
[Tool Usage]
[Tool Usage]
[Associate] I'll check the git log and look at the recent work.
[Tool Usage]
Start development server
[Associate] That's great! Now let me navigate to the application and verify that some of the basic features are still valid.
[Associate] Based on my validation test, I can see that the basic function is working well. The core chat function, the theme switch, the dialogue load and the error process are functioning. Now let me look at the tests.json file more thoroughly to see what's going on.
Start developing new functionality
Agent failure mode and solution
Question of the initialization of the agent's behavioral code of conduct.
|-|-|-|-|
Claude prematurely declared the entire project successful. Sets the functionality list file: Sets the structured JSON file that contains the end-to-end functional description list according to the input specifications. Reads the functional list file at the beginning of the session. Select a function to start processing. Zenium
Claude left the environment in a state of error or failure to record progress. An initial glit repository and a progress comment file were prepared. | Starts the session by reading progress notes and git submitting logs and runs the basic test on the development server to capture any unrecorded errors. End session by preparing guit submission and progress update. Zenium
Claude labels the function as premature. Sets the functional list file. Self-validation of all functions. The function is only marked " pass " after careful testing. Zenium
Claude must take time to figure out how to run the application. Prepares an init.sh script that can run the development server. | Starts session by reading init.sh. Zenium
The study demonstrated a set of possible solutions in long-run proxy tools to enable the model to make incremental progress in many context windows. However, a number of outstanding issues remain.
Most notably, it is not clear whether individual generic coding agents perform best across the context or whether they can achieve better performance through a multi-agent structure. It seems reasonable that specialized agents such as test agents, quality assurance agents or code clean-up agents can do better on subtasks throughout the software development life cycle.
In addition, the presentation was optimized for the development of a whole house Web application. The way forward is to extend these findings to other areas. Some or all of these lessons are likely to be applied to long-term operational proxy types such as scientific research or financial modelling.
Prepared by Justin Young. Special thanks go to David Hershey, Prithvi Rajasakeran, Jeremy Hadfield, Naia Bouscal, Michael Tinsley, Jesse Mu, Jake Eaton, Marius Buleendara, Maggie Vo, Pedro Navid, Nadine Yasser and Alex Notov for their contributions.
This job reflects... Anthropic The collective efforts of multiple teams that enabled Claude to safely carry out long-term self-owned software work, especially the RL and Claude Code teams. Interested candidates are invited to apply on anthropic.com/careers.
1. In the present paper, we refer to these agents as separate agents only because they have different initial user tips. The system alerts, tool sets and corporate proxy tools are in other respects the same.
Product upgrading, operating methods, community focus, etc. Send it to your inbox every month.
