Lengthy-Working Brokers – O’Reilly

June 9, 2026

18

The next article initially appeared on Addy Osmani’s weblog and is being reposted right here with the writer’s permission.

An extended-running AI agent can hold making progress over hours, days, or weeks. It will possibly do that throughout many context home windows and sandboxes, get well from failure, go away structured artifacts behind, and resume the place it left off.

For 2 years the dominant picture of an “AI agent” has been a chat window with a intelligent loop in it. You kind a objective; the agent calls some instruments; you watch tokens stream by; you cease watching when the work runs out of endurance or the context window fills up. That paradigm received us a great distance, however it has a ceiling. The mannequin forgets. It declares “process full” when it isn’t. It reintroduces a bug it fastened 9 turns in the past. The entire thing is structured round a single sitting.

Lengthy-running brokers are what comes subsequent. The concept is straightforward to state: an agent that retains making ahead progress on a objective throughout many classes and lots of sandboxes, probably many days or even weeks, whereas leaving the workspace clear sufficient that the subsequent session can decide up the place the final one left off. The engineering is more durable. You need to clear up for persistence, restoration, and verification in a means that doesn’t simply paper over the cracks. You need to construct a state layer that lives exterior the mannequin’s context window, and you need to design the handoff between classes so the agent doesn’t lose its thoughts when it wakes up and finds itself in a special sandbox with a special context window.

This put up is my try to put out what’s modified, who’s pushing on it, and the way an engineer can use long-running brokers at present with out writing the entire thing from scratch.

What “long-running” really means

“Lengthy-running” used to imply a minimum of three various things in observe, and it helps to maintain them separate.

Lengthy-horizon reasoning. The agent has to plan and execute over many dependent steps. That is largely a model-quality story: coherence, planning, the flexibility to get well from a flawed flip 10 steps in the past. METR has been monitoring this with their time horizon metric, which estimates how lengthy a process a frontier mannequin can full with 50% reliability. The headline discovering is that the metric has been doubling roughly each seven months since 2019, and their TH1.1 replace earlier this 12 months doubled the rely of eight-hour-plus duties within the eval set. If that curve holds, frontier brokers full duties on the day scale by 2028 and the 12 months scale by 2034.

Lengthy-running execution. The agent’s course of runs for hours or days. Possibly it’s a coding job, perhaps it’s a analysis sweep, perhaps it’s a 24-7 monitoring service. The mannequin could be invoked 1000’s of occasions throughout the run. That is largely a harness story, and it’s the one this put up is generally about.

Persistent company. The agent has an id that outlives any single process. It accumulates reminiscence, learns person preferences, and is all the time out there. That is the Reminiscence Financial institution taste of long-running.

In observe the three blur collectively. An actual manufacturing agent does long-horizon reasoning inside a long-running execution backed by persistent company. However the engineering issues are totally different in every, and so are the merchandise that clear up them.

Why this issues

There are two causes I imagine this work issues quite a bit proper now.

The primary is a part change in what’s economically possible to delegate. An agent that runs for 10 minutes can reply a query, summarize a doc, repair a small bug. An agent that runs for 10 hours can personal a complete characteristic, end a migration that was on the backlog for six quarters, or do the sort of in a single day analysis sweep that used to require a junior analyst. One among Anthropic’s Claude Sonnet bulletins put concrete numbers on this final fall: 30+ hours of autonomous coding in inside exams, together with one run that produced an 11,000-line Slack-style app. That’s already previous the edge the place the reply to “Ought to I delegate this?” is not apparent.

The second is that persistence modifications what the agent is. A stateless agent solutions your query and disappears. An extended-running one accumulates context: which competitor moved which means final week, which take a look at flaked twice on Tuesday, what you normally imply by “the dashboard.” Anthropic’s Undertaking Vend was essentially the most public early demonstration of this. They’d a Claude occasion run an precise workplace merchandising enterprise for a month, managing stock, setting costs, speaking to suppliers. It failed in informative methods, and the second part ran a lot better, however the level wasn’t profitability. The purpose was watching what sorts of bizarre coherence issues present up when an agent has to keep up id throughout weeks as a substitute of turns.

These are the identical issues each workforce constructing manufacturing brokers now hits.

The three partitions each long-running agent hits

Three partitions present up in mainly each write-up I’ve learn this 12 months.

Finite context. Even a 1M-token window fills. And context rot, the regular degradation of mannequin efficiency because the window will get full, kicks in nicely earlier than the onerous restrict. A 24-hour run will not be going to slot in any context window the sphere has on its roadmap. One thing has to present.

No persistent state. A brand new session begins clean. Anthropic’s framing of their scientific computing put up is the cleanest model I’ve seen: “Think about a software program challenge staffed by engineers working in shifts, the place every new engineer arrives with no reminiscence of what occurred on the earlier shift.” With out an specific persistence story, each shift change is a productiveness catastrophe.

No self-verification. Fashions reliably skew optimistic after they grade their very own work. Requested “Are you performed?” they reply “sure” extra usually than they need to. And not using a separate sign that the work meets a bar, you get the agent that ships at 30% full with full confidence.

Lengthy-running agent designs are largely solutions to those three issues. The most important labs have converged on related shapes of reply, however with very totally different floor space.

The Ralph loop: One of many less complicated practitioner variations of long-running brokers

The Ralph loop (generally known as the Ralph Wiggum method) is one in all “less complicated” practitioner model of long-running brokers, popularized by Geoffrey Huntley and Ryan Carson. The reference implementation is actually a bash script that loops:

Decide the subsequent unfinished process from a listing (prd.json or equal).
Construct a immediate with the duty, the related context, and any persistent notes.
Name the agent.
Run exams or different checks.
Append what occurred to progress.txt.
Replace the duty record (performed, failed, blocked).
Return to step 1.

The rationale it really works is identical cause any of the harnesses under work: State lives exterior the agent’s context. prd.json is the plan, progress.txt is the lab notes, and AGENTS.md is the rolling rulebook. The agent itself is amnesiac, however the filesystem isn’t. Every iteration begins recent and reads sufficient state from disk to maintain going. Carson’s Compound Product extends the concept by chaining a number of loops (an evaluation loop that reads every day stories, a planning loop that emits a PRD, an execution loop that writes the code), which is roughly the open supply model of the planner-generator-evaluator triad Anthropic landed on independently.

I went deeper on all of this in “Self-Enhancing Coding Brokers”: process record construction, progress information, QA gates, monitoring, the failure modes you’ll really hit. The brief model is that you could construct a working long-running agent in a night with a bash script and a JSON file. Most of what Google and Anthropic have productized is the work of constructing this sample recoverable, safe, and observable at scale.

The large-lab tales under are alternative ways of paying for that production-readiness.

Anthropic: Harnesses, then the mind/palms/session cut up

Anthropic has been essentially the most public in regards to the engineering. Two posts are value studying finish to finish.

The primary is “Efficient Harnesses for Lengthy-Working Brokers,” which lays out a two-agent harness for autonomous full stack improvement. An initializer agent runs as soon as at the beginning of a challenge to arrange the atmosphere, broaden the immediate right into a structured feature-list.json, and write an init.sh that future classes will run on boot. A coding agent is then woken up time and again, every session requested to make incremental progress on one characteristic, run exams, go away a claude-progress.txt notice, and commit. A take a look at ratchet (“it’s unacceptable to take away or edit exams as a result of this might result in lacking or buggy performance”) sits within the immediate to cease the quite common failure of an agent deleting failing exams to “make them go.” InfoQ’s writeup extends this right into a planner, generator, and evaluator triad, on the identical logic that separating technology from analysis issues as a result of fashions grade their very own work too generously.

The second is “Scaling Managed Brokers: Decoupling the Mind from the Fingers,” the architectural put up behind Claude Managed Brokers (Anthropic’s hosted runtime, launched in early April). The argument is that an agent has three parts that needs to be independently replaceable. The Mind is the mannequin and the harness loop that calls it. The Fingers are sandboxed, ephemeral execution environments the place instruments really run. The Session is an append-only occasion log of each thought, instrument name, and remark.

This sounds summary, however it isn’t. Right here’s Anthropic’s framing: “Each element in a harness encodes an assumption about what the mannequin can’t do by itself.” If you couple them, an assumption that goes stale (e.g., the mannequin used to want an specific planner and now plans natively) means the entire system has to alter directly. If you decouple them, the harness turns into stateless, sandboxes develop into cattle, not pets, and a mind crash doesn’t lose the run. A recent container calls wake(sessionId) and reconstitutes the state from the log. They reported time-to-first-token dropped ~60% at p50 and over 90% at p95 simply from having the ability to begin inference earlier than the sandbox is prepared.

The session-as-event-log concept is the half most groups underappreciate. It’s what makes a long-running agent recoverable. With out it, a container failure is a session failure and also you’re debugging right into a stale snapshot. With it, the agent’s reminiscence is a queryable artifact that lives exterior no matter course of occurs to be working in the intervening time.

For the scientific computing crowd, Anthropic’s “long-running Claude” put up reduces all of this to an easier stack: CLAUDE.md as a residing plan the agent edits because it learns, CHANGELOG.md as transportable lab notes, tmux plus SLURM plus git because the execution and coordination layer, and the Ralph loop, a for loop that kicks the agent again into context every time it claims completion and asks if it’s actually performed. Their flagship case examine is a Boltzmann solver Claude Opus 4.6 constructed over a number of days that reached subpercent settlement with a reference CLASS implementation. Months to years of researcher time, compressed.

Identical patterns throughout all three posts: an specific plan file, an specific progress file, structured handoffs between classes, separate technology from analysis, and a loop that refuses to let the agent cease early.

Cursor: Planners, employees, judges

Cursor’s “Scaling Lengthy-Working Autonomous Coding” is the opposite important learn this 12 months. They walked into partitions that Anthropic largely papered over.

Their first try was a flat coordination mannequin: equal-status brokers writing to shared information with locks. It grew to become a bottleneck and made the brokers danger averse, churning somewhat than committing. Their second try swapped locks for optimistic concurrency management, which eliminated the bottleneck however didn’t repair the coordination drawback. The third design is what’s working in manufacturing now and what they describe as fixing a lot of the drawback:

Planners constantly discover the codebase and emit duties. They will recursively spawn subplanners.
Employees are targeted executors. They don’t coordinate with one another and so they don’t fear in regards to the large image.
Judges determine when an iteration is completed and when to restart.

Two issues stand out from the put up. One: “A shocking quantity of the system’s habits comes right down to how we immediate the brokers” greater than the harness or the mannequin. Two: Totally different fashions slot into totally different roles. Their reported discovering is {that a} GPT mannequin was higher than Opus for prolonged autonomous work particularly as a result of Opus tended to cease early and take shortcuts. Identical process, totally different function, totally different mannequin. The matching is turning into a part of the design floor.

This pairs with Composer 2 (their proprietary frontier coding mannequin that ships in Cursor 3) and their background cloud brokers: long-running duties that run on Anysphere’s cloud infrastructure somewhat than your laptop computer. Eight-hour refactors and codebase-wide migrations survive a closed lid. You can begin a process domestically, hit run in cloud once you understand it’ll take half-hour, and reattach later out of your cellphone. Every agent runs in an remoted Git worktree and merges again by way of PR. The handoff between native and distant is the half most groups haven’t discovered but, and Cursor’s wager is that it must be its personal product floor.

The form finally ends up near Anthropic’s: Roles are cut up, classes are sturdy, judges sit beside the employee, and an extended process runs in a cloud sandbox with Git because the coordination substrate.

Google: Lengthy-running brokers on the Agent Platform

Google’s announcement at Cloud Subsequent ’26 folded Vertex AI into the Gemini Enterprise Agent Platform and turned long-running brokers right into a named product, with named SLAs.

The items that matter for this put up:

Agent Runtime helps brokers that “run autonomously for days at a time” with sub-second chilly begins and on-demand sandbox provisioning. The launch put up’s instance use case is a gross sales prospecting sequence that takes per week to play out, which is roughly the precise form for it.
Agent Periods persist dialog and occasion historical past. You may pin them to a customized session ID that maps to your individual CRM or DB file, so the agent’s state lives subsequent to the enterprise state as a substitute of in a separate AI silo.
Agent Reminiscence Financial institution is the persistent long-term reminiscence layer, typically out there as of Subsequent ’26. It curates reminiscences from classes, scopes them to a person id, and exposes a search API so the subsequent agent invocation can pull what’s related. Payhawk reported that auto-submitting bills by a Reminiscence Financial institution-backed agent lower submission time by over 50%.
Agent Sandbox handles hardened code execution.
Agent-to-Agent Orchestration, Agent Registry, Agent Id, Agent Gateway, Agent Observability, and Agent Simulation cowl mainly each operational concern you’d in any other case construct by hand for a manufacturing fleet, together with the cryptographic-identity-and-audit-log story enterprises really have to ship.

Architecturally this is identical mind/palms/session cut up Anthropic described, simply productized at platform scale and bundled with ADK (the code-first dev equipment) and Agent Studio (the visible one). In the event you’re constructing inside Google Cloud, you don’t must design a session log or a reminiscence retailer from scratch anymore. You wire an ADK agent into Reminiscence Financial institution and Periods, deploy onto Agent Runtime, and the persistence query is answered.

Discover how a lot this seems just like the sample Anthropic and Cursor describe, simply unbundled into named providers with SLAs. Three years in the past you’d have constructed all of this your self. Now you decide which model of “decoupled mind, palms, and session” you need to lease.

5 patterns for long-running brokers in manufacturing

Shubham Saboo and I wrote up 5 design patterns we’ve seen separate working long-running brokers from demos. They aren’t Google-specific, however they map cleanly onto the primitives Agent Runtime now exposes, so it’s value strolling by them right here in shortened kind.

Checkpoint-and-resume. The commonest multiday failure is context loss. An agent processes 200 paperwork over 4 hours, hits an error on doc 201, and and not using a checkpoint you begin from scratch. Deal with the agent like a long-running server course of: write intermediate state to disk, checkpoint each N items of labor, get well from failures. The Agent Runtime sandbox offers you a persistent filesystem, however selecting the best checkpoint granularity (not each step, not solely the top) is on you.

Delegated approval (human-in-the-loop). Most “human-in-the-loop” implementations are: serialize state to JSON, fireplace a webhook, hope somebody responds. The state goes stale, the notification will get buried, the agent re-deserializes right into a barely totally different world. Lengthy-running runtimes let the agent pause in place with full execution state intact: reasoning chain, working reminiscence, instrument historical past, pending motion. Hours of human time go, the agent consumes zero compute, and it resumes with subsecond latency. Mission Management is Google’s inbox for this. The sample works no matter vendor.

Reminiscence-layered context. A seven-day agent wants greater than session state. Reminiscence Financial institution handles long-term curated reminiscence, Reminiscence Profiles add low-latency lookups, and the failure mode you’ll hit in manufacturing is reminiscence drift: The agent learns a procedural shortcut from a number of atypical interactions and begins making use of it broadly. Govern reminiscence such as you govern microservices. Agent Id controls who can learn and write which banks. Agent Registry tracks which model of which agent is working. Agent Gateway enforces coverage on the wire. The auditing query stops being “What are my brokers doing?” and turns into “What are my brokers remembering, and the way is that altering their habits?”

Ambient processing. Not each long-running agent talks to a human. Some sit on a Pub/Sub stream or a BigQuery desk and act on occasions as they arrive: content material moderation, anomaly detection, inbox triage. The architectural choice value making early is to not hardcode coverage into the agent. Outline it within the Gateway and the fleet picks up coverage modifications with out redeploys. Ambient brokers run unsupervised for lengthy stretches, and the one sane approach to replace 100 of them is to replace the coverage layer as soon as.

Fleet orchestration. In actual programs, you hardly ever have one agent. A coordinator delegates subtasks to specialists (a Lead Researcher Agent, a Scoring Agent, an Outreach Agent), every working independently for various durations. Every specialist will get its personal Id (so the Outreach Agent can’t learn monetary information meant for Scoring), its personal coverage enforcement, its personal Registry entry. This is identical coordinator/employee form distributed programs have used for many years. What’s new is that ADK handles it declaratively with graph-based workflows, and a foul deployment in a single specialist doesn’t cascade to the others.

The patterns compose. A compliance system would possibly use checkpointing for doc processing, delegated approval for evaluate gates, reminiscence layering for cross-session data, and fleet orchestration to coordinate the specialists. The opening query is all the time the identical: What’s the longest uninterrupted unit of labor your agent must carry out? Minutes, and also you don’t want long-running brokers. Hours or days, and these patterns are the place to start out. The full write-up with code samples covers every sample in depth.

So how do you really construct one at present?

That is the sensible query, and it has a special reply relying on what you’re constructing.

You’re a developer who desires long-running coding work by yourself repo. Simply use Claude Code (or Antigravity, Cursor, or Codex). The harness is already there. Deal with your AGENTS.md like a pilot’s guidelines: brief, each line earned by an actual failure. Add hooks for typecheck and lint that floor failures again to the agent. Write a plan file earlier than the agent begins. Use the Ralph loop when the agent claims it’s performed and also you don’t imagine it. For multihour or in a single day jobs, run in a worktree so a closed laptop computer doesn’t kill the run, and have it commit progress each significant unit of labor. That is the trail most individuals ought to take, and it’s the place essentially the most leverage is true now.

You’re constructing a hosted agent product. Don’t construct the runtime. Decide a managed one. The three actual choices at present: Google’s Agent Platform (Agent Engine + Reminiscence Financial institution + Periods), Claude Managed Brokers, or roll one thing on high of ADK, the Claude Agent SDK, or Codex SDK and host it your self. The trade-off is the same old one. Managed will get you the mind/palms/session cut up, observability, id, and an audit path out of the field. Self-hosted will get you management and the flexibility to make use of bizarre fashions for bizarre roles (Cursor’s sample). For many groups, the precise place to begin is a managed runtime plus your individual ADK or SDK code for the precise loop.

You’re doing one thing autonomous and operational (monitoring, analysis, ops). Reminiscence Financial institution-style persistence is what you need, and it’s the half that doesn’t exist in Claude Code. ADK + Reminiscence Financial institution + Cloud Run + Cloud Scheduler is the cleanest stack I’ve seen for “agent runs each N hours, accumulates state, alerts on a threshold.” That is additionally the place Cursor’s planner/employee/choose cut up begins to matter greater than it does for IDE coding, as a result of the work is genuinely parallel and the failure modes are totally different.

Just a few issues matter no matter which path you’re taking.

Write down the performed situation earlier than the agent begins. That is the only highest-leverage transfer for lengthy runs. The Anthropic harness put up calls it the characteristic record; Cursor calls it the planner’s process spec. Both means, it’s an exterior file with specific, testable completion standards, and it exists so the agent can’t quietly redefine performed midrun.

Separate the evaluator from the generator. Self-grading is the failure mode. A planner/employee/choose pipeline, or a generator/evaluator pair, is an actual architectural sample, not a stylistic choice. Even when it’s the identical mannequin in several roles with totally different prompts.

Put money into the session log, not simply the immediate. The append-only occasion log is what makes the agent recoverable, debuggable, and auditable. In the event you can’t reconstruct what the agent did within the final 24 hours from sturdy storage, what you have got is a long-running shell script that occurs to name an LLM, not a long-running agent.

Deal with compaction and context resets as top quality. Anthropic is specific that summarization-as-compaction wasn’t sufficient for very lengthy jobs; they needed to do full context resets the place the harness tears the session down and rebuilds it from a structured handoff file. It’s primarily how people onboard a brand new engineer.

There are some actual limitations proper now

Just a few issues are nonetheless genuinely unsolved.

Price. A 24-hour run with a frontier mannequin and some instruments will not be low-cost. With out budgets, circuit breakers, and a tough cap on instrument spend, an agent can quietly burn by per week’s API finances in a day. That is solvable, however it’s an specific step you need to take.

Safety. An extended-running agent with API keys, cloud entry, and the flexibility to run shell instructions has a a lot bigger assault floor than a chat session. The mind/palms separation sample issues right here too: Credentials needs to be unreachable from the sandbox the place model-generated code runs, which is likely one of the advantages Anthropic calls out for Managed Brokers.

Alignment drift. Over many context home windows, brokers drift. The unique objective will get summarized, then resummarized, then loses constancy. That is the half hooks and judges exist to defend in opposition to. It is usually the most typical cause “the agent went off and did one thing I didn’t ask for.”

Verification. Auditing 24 hours of autonomous exercise is an actual human-time drawback. Observability and structured artifacts (PRs, commits, briefings, take a look at runs) are the way you make this tractable. With out them, you’re scrolling logs and also you’ll miss what issues.

The human function. That is the one I hold coming again to. Defining work crisply sufficient that an agent can run for a day on it’s more durable than doing the work your self. The ability that’s appreciating in worth isn’t writing code. It’s writing specs that survive contact with an autonomous executor.

The place that is going

Google, Anthropic, and Cursor have converged on roughly the identical form. Separate the mannequin loop from the execution sandbox from the sturdy session log. Cut up planning from technology from analysis. Bake in compaction, hooks, and context resets. Expose reminiscence as a managed service that any agent invocation can question.

Floor space is what differs. Google’s Agent Platform is the enterprise-stack model, with the id and audit path story baked in. The patterns beneath are the identical. Claude Managed Brokers is “Anthropic’s harness, hosted.” Cursor’s background brokers are “long-running coding, pulled out of the IDE and into the cloud.”

The more durable issues for the subsequent 12 months aren’t in any of these layers individually. They’re within the coordination above them. Many long-running brokers on a shared codebase. Brokers that learn their very own traces and patch their very own harnesses. Harnesses that assemble instruments and context simply in time for a process as a substitute of being preconfigured at startup. That’s the place the agent stops trying like a wiser chat window and begins trying like a colleague who’s been on the challenge longer than you have got.

The mannequin continues to be load-bearing. However the hole between a chat window and an agent you’ll be able to go away working in a single day is generally within the state, classes, and structured handoffs wrapped round it. That’s the place I’d spend my studying time proper now.

Lengthy-Working Brokers – O’Reilly

What “long-running” really means

Why this issues

The three partitions each long-running agent hits

The Ralph loop: One of many less complicated practitioner variations of long-running brokers

Anthropic: Harnesses, then the mind/palms/session cut up

Cursor: Planners, employees, judges

Google: Lengthy-running brokers on the Agent Platform

5 patterns for long-running brokers in manufacturing

So how do you really construct one at present?

There are some actual limitations proper now

The place that is going

Related Articles

Investing within the Way forward for Mexico’s Telco Panorama

After surprising quarter, IBM insists that AI is not killing the mainframe

GKN Aerospace and Pratt & Whitney increase additive manufacturing work to F135 engine | VoxelMatters

LEAVE A REPLY Cancel reply

Latest Articles

Investing within the Way forward for Mexico’s Telco Panorama

After surprising quarter, IBM insists that AI is not killing the mainframe

GKN Aerospace and Pratt & Whitney increase additive manufacturing work to F135 engine | VoxelMatters

MIT’s new lidar chip might give self-driving vehicles a wider view

How Alight Options achieved 55% price financial savings with Amazon OpenSearch Service

ABOUT US