Agent Harness Engineering – O’Reilly

May 16, 2026

21

This text was initially revealed on Addy Osmani’s weblog. It’s being reposted right here with the writer’s permission.

Roughly: Anytime you discover an agent makes a mistake, you are taking the time to engineer an answer such that the agent by no means makes that mistake once more.

We’ve spent the final two years arguing about fashions. Which one is smartest, which one writes the cleanest React, which one hallucinates much less. That dialog is okay so far as it goes, but it surely’s lacking the opposite half of the system. The mannequin is one enter right into a operating agent. The remainder is the harness: the prompts, instruments, context insurance policies, hooks, sandboxes, subagents, suggestions loops, and restoration paths wrapped across the mannequin so it may possibly really end one thing.

An honest mannequin with an incredible harness beats an incredible mannequin with a foul harness. I’ve watched this play out by myself work again and again. And more and more the fascinating engineering isn’t in selecting the mannequin; it’s in designing the scaffolding round it.

That self-discipline now has a reputation. Viv Trivedy coined the time period harness engineering, and his “Anatomy of an Agent Harness” publish is the cleanest derivation of what a harness really is and why every bit exists. Dex Horthy has been monitoring the sample because it emerges. HumanLayer frames most agent failures as “ability points” that come right down to configuration somewhat than mannequin weights. Anthropic’s engineering workforce has revealed what I feel is the most effective public breakdown of learn how to design a harness for long-running work. And Birgitta Böckeler has overview of what this appears like from the person’s aspect.

This publish is my try to tug these threads collectively.

What’s a harness, actually?

Viv’s one-liner does many of the work:

Agent = Mannequin + Harness. Should you’re not the mannequin, you’re the harness.

A harness is each piece of code, configuration, and execution logic that isn’t the mannequin itself. A uncooked mannequin is just not an agent. It turns into one as soon as a harness offers it state, device execution, suggestions loops, and enforceable constraints.

The model is one chip on the board. The harness is everything else that makes it useful.

Concretely, a harness consists of:

System prompts, CLAUDE.md, AGENTS.md, ability information, and subagent prompts
Instruments, abilities, MCP servers, and their descriptions
Bundled infrastructure (filesystem, sandbox, browser)
Orchestration logic (subagent spawning, handoffs, mannequin routing)
Hooks and middleware for deterministic execution (compaction, continuation, lint checks)
Observability (logs, traces, value and latency metering)

Simon Willison reduces the loop half to its essence: an agent is a system that “runs instruments in a loop to realize a objective.” The ability is within the design of each the instruments and the loop.

If that seems like lots of floor space, it’s. And it’s your floor space, not the mannequin supplier’s. Claude Code, Cursor, Codex, Aider, Cline: These are all harnesses. The mannequin beneath is typically the identical, however the habits you expertise is dominated by what the harness does.

coding agent = AI mannequin(s) + harness

This equation, articulated by Viv and echoed by HumanLayer, is the place the work really lives. The controversy over the left-hand aspect is loud. Many of the precise leverage sits on the proper.

The “ability challenge” reframe

There’s a sample I watch engineers fall into. The agent does one thing dumb, the engineer blames the mannequin, and the blame will get filed underneath “look forward to the following model.”

The harness-engineering mindset rejects that default. The failure is often legible. The agent didn’t find out about a conference, so that you add it to AGENTS.md. The agent ran a harmful command, so that you add a hook that blocks it. The agent bought misplaced in a 40-step process, so that you cut up it right into a planner and an executor. The agent stored “ending” damaged code, so that you wire a typecheck back-pressure sign into the loop.

HumanLayer says: “It’s not a mannequin drawback. It’s a configuration drawback.” Harness engineering is what occurs if you take that severely.

There’s a putting information level that reveals up in each Viv’s write-up and HumanLayer’s. On Terminal Bench 2.0, Claude Opus 4.6 operating inside Claude Code scores far decrease than the identical mannequin operating in a customized harness. Viv’s workforce moved a coding agent from Prime 30 to Prime 5 by altering solely the harness. Fashions get posttraining coupled to the harness they have been skilled in opposition to. Transferring them into a unique harness, with higher instruments to your codebase, a tighter immediate, and sharper backpressure, can unlock functionality the unique harness was leaving on the ground.

That is the other of the “simply look forward to GPT-6” narrative. The hole between what at this time’s fashions can do and what you see them doing is basically a harness hole.

The ratchet: Each mistake turns into a rule

Crucial behavior in harness engineering is treating agent errors as everlasting indicators. Not one-off tales to snort about, not “unhealthy runs” to retry. Indicators.

If the agent ships a PR with a commented-out check and I merge it accidentally, that’s an enter. The subsequent model of my AGENTS.md says “by no means remark out assessments; delete them or repair them.” The subsequent model of my precommit hook greps for .skip( and xit( within the diff. The subsequent model of my reviewer subagent flags commented-out assessments as a blocker.

You solely add constraints if you’ve seen an actual failure. You solely take away them when a succesful mannequin has made them redundant. Each line in AGENTS.md needs to be traceable again to a selected factor that went improper.

That is additionally why harness engineering is a self-discipline somewhat than a framework. The correct harness to your codebase is formed by your failure historical past. You possibly can’t obtain it.

Working backward from habits

The framing from Viv that I discover most helpful after I’m really designing a harness is to begin from the habits you need and derive the harness piece that delivers it. His sample: habits we would like (or wish to repair) → harness design to assist the mannequin obtain this.

Every harness feature is a bridge across a specific thing the model can't do on its own

The helpful factor about deriving it this manner is that each harness element has a selected job. Should you can’t title the habits a element exists to ship, it most likely shouldn’t be there.

The remainder of this part walks the items in roughly the order Viv does, with the particular patterns I’ve discovered price stealing.

Filesystem and Git: Sturdy state

The filesystem is probably the most foundational primitive, and it tends to be underrated as a result of it’s boring. Fashions can solely straight function on what matches in context. With out a filesystem, you’re copy-pasting right into a chat window, and that isn’t a workflow.

After you have a filesystem, the agent will get a workspace to learn information, code, and docs; a spot to dump intermediate work as a substitute of holding it in context; and a floor the place a number of brokers and people can coordinate by shared information. Including Git on high offers you versioning without spending a dime, so the agent can observe progress, roll again errors, and department experiments.

Many of the different harness primitives find yourself pointing on the filesystem for one thing.

Bash and code execution: The overall-purpose device

The principle agent loop at this time is a ReAct loop: The mannequin causes, takes an motion through a device name, observes the end result, and repeats. However a harness can solely execute the instruments it has logic for. You possibly can attempt to prebuild a device for each potential motion, otherwise you may give the agent bash and let it construct the instruments it wants on the fly.

Willison’s tackle that is that brokers already excel at shell instructions; most duties collapse to a couple well-chosen CLI invocations. Harnesses nonetheless ship centered instruments, however bash plus code execution has develop into the default general-purpose technique for autonomous drawback fixing. It’s the distinction between educating somebody to make use of a single kitchen gadget and handing them a kitchen.

Sandboxes and default tooling

Bash is just helpful if it runs someplace secure. Working agent-generated code in your laptop computer is dangerous, and a single native setting doesn’t scale to many parallel brokers.

Sandboxes give brokers an remoted working setting. As an alternative of executing domestically, the harness connects to a sandbox to run code, examine information, set up dependencies, and confirm work. You possibly can allow-list instructions, implement community isolation, spin up new environments on demand, and tear them down when the duty is completed.

A great sandbox ships with good defaults: preinstalled language runtimes and packages, Git and check CLIs, a headless browser for net interplay. Browsers, logs, screenshots, and check runners are what let the agent observe its personal work and shut the self-verification loop.

The mannequin doesn’t configure its execution setting. Deciding the place the agent runs, what’s out there, and the way it verifies its output are all harness-level calls.

Reminiscence and search: Continuous studying

Fashions don’t have any extra information past their weights and what’s at the moment in context. With out the power to edit weights, the one approach so as to add information is thru context injection.

The filesystem is once more the primitive. Harnesses assist reminiscence file requirements like AGENTS.md that get injected on each begin. Because the agent edits that file, the harness reloads it, and information from one session carries into the following. This can be a crude however efficient type of continuous studying.

For information that didn’t exist at coaching time (new library variations, present docs, at this time’s information), net search and MCP instruments like Context7 bridge the cutoff. These are helpful primitives to bake into the harness somewhat than leaving to the person.

Battling context rot

Context rot is the statement that fashions worsen at reasoning and finishing duties because the context window fills up. Context is scarce, and harnesses are largely supply mechanisms for good context engineering.

Three strategies present up repeatedly:

Compaction. When the window will get near full, one thing has to offer. Letting the API error is just not an choice for a manufacturing harness, so the harness intelligently summarizes and offloads older context so the agent can maintain working.

Device-call offloading. Massive device outputs (suppose 2,000-line log information) litter context with out including a lot sign. The harness retains the pinnacle and tail tokens above a threshold and offloads the complete output to the filesystem, the place the agent can learn it on demand.

Abilities with progressive disclosure. Loading each device and MCP into context at startup degrades efficiency earlier than the agent takes a single motion. Abilities let the harness reveal directions and instruments solely when the duty really requires them.

Anthropic’s harness publish provides yet another approach for the actually lengthy jobs: full context resets, the place the harness tears the session down and rebuilds it from a compact handoff file. They’re express that compaction alone wasn’t enough for lengthy duties; generally it’s essential begin contemporary with a structured transient. That is nearer to how people onboard a brand new engineer than to how we often take into consideration “reminiscence.”

Lengthy-horizon execution: Ralph loops, planning, verification

Autonomous long-horizon work is the holy grail and the toughest factor to get proper. Right now’s fashions endure from early stopping, poor decomposition of advanced issues, and incoherence as work stretches throughout a number of context home windows. The harness has to design round all of that.

I’ve written about autonomous coding loops just like the Ralph loop earlier than in self-improving brokers and in my 2026 traits piece, but it surely’s price restating on this framing: A hook intercepts the mannequin’s try to exit and reinjects the unique immediate right into a contemporary context window, forcing the agent to proceed in opposition to a completion objective. Every iteration begins clear however reads state from the earlier one by the filesystem. It’s a surprisingly easy trick for turning a single-session agent right into a multisession one, and it’s the type of primitive you’d by no means derive from “simply use a wiser mannequin.”

Planning is when the mannequin decomposes a objective right into a sequence of steps, often right into a plan file on disk. The harness helps this with prompting and reminders about learn how to use the plan file. After every step, the agent checks its work through self-verification: Hooks run a predefined check suite and loop failures again to the mannequin with the error textual content, or the mannequin evaluations its personal output in opposition to express standards.

Planner/generator/evaluator splits. Anthropic’s long-running harness work is express that separating era from analysis into distinct brokers outperforms self-evaluation, as a result of brokers reliably skew constructive when grading their very own work. It’s GANs for prose. The associated sample is the dash contract, the place the generator and evaluator negotiate what “performed” really means earlier than code will get written. In my very own workflows, writing down the performed situation earlier than beginning has caught extra scope drift than any immediate change I’ve ever made.

Hooks: The enforcement layer

Hooks are what separate “I instructed the agent to do X” from “the system enforces X.”

A hook is a script that runs at a selected lifecycle level: earlier than a device name, after a file edit, earlier than commit, on session begin. They’re the proper place for issues the agent ought to always remember however usually does. Run typecheck and lint and assessments after each edit and floor failures. Block harmful bash (rm -rf, git push --force, DROP TABLE). Require approval earlier than opening a PR or pushing to important. Auto-format on write so the agent doesn’t waste tokens on whitespace.

The precept HumanLayer highlights and I’ve come to agree with is: Success is silent; failures are verbose. If typecheck passes, the agent hears nothing. If it fails, the error textual content will get injected into the loop and the agent self-corrects. That makes the suggestions loop nearly free within the widespread case and straight actionable when one thing goes improper.

AGENTS.md and power alternative

The flat markdown rulebook on the root of your repo remains to be the one highest-leverage configuration level, as a result of it lands within the system immediate each flip. Conventions go right here: bundle supervisor, check framework, formatting, “by no means contact /legacy,” “at all times use our logger.” Two hard-won classes:

Maintain it quick. HumanLayer retains theirs underneath 60 traces. Each line is competing for consideration, and extra guidelines make every rule matter much less. Pilot’s guidelines, not model information.

Earn every line. Guidelines ought to hint to a selected previous failure or a tough exterior constraint. In the event that they don’t, they’re noise. Ratchet; don’t brainstorm.

Identical self-discipline applies to instruments. Every device’s title, description, and schema will get stamped into the immediate each request. Ten centered instruments outperform fifty overlapping ones as a result of the mannequin can maintain the menu in its head. HumanLayer additionally flags an actual safety concern right here: device descriptions populate the immediate, so any MCP server you put in is trusted textual content the mannequin will learn. A sloppy or malicious MCP can prompt-inject your agent earlier than you’ve typed something.

What this appears like in manufacturing

The clearest public image I’ve seen of a mature harness is Fareed Khan’s (estimated) breakdown of Claude Code’s structure.

Nearly each idea from the earlier part reveals up on this diagram as a named element. Context injection is the information layer. Loop state lives within the reminiscence retailer and the worktree isolator. Harmful-action hooks sit behind the permission gate. Subagent context firewalls are your complete multi-agent layer. The device dispatch registry is the place MCP servers and bash each plug in. Khan’s argument is identical as Viv’s, simply labored by a delivery product: Claude Code’s trajectory is concerning the harness at the least as a lot as concerning the mannequin beneath it.

Harnesses don’t shrink; they transfer

One of many higher observations within the Anthropic write-up is that as fashions enhance, the area of fascinating harness mixtures doesn’t shrink. It strikes.

The naive story is that higher fashions make harnesses out of date. If the mannequin can plan, no planner. If the mannequin is coherent at lengthy horizons, no context resets. And sure, Opus 4.6 largely killed the context-anxiety failure mode (Sonnet 4.5 used to wrap up work prematurely because it approached what it thought was its context restrict), which implies an entire class of anxiety-mitigation scaffolding I used to be writing six months in the past is now useless code.

However the ceiling moved with the mannequin. Duties that have been unreachable are in play, and so they have their very own failure modes. The anxiousness scaffolding goes away, and as an alternative you want a multiday reminiscence coverage or a harness that coordinates three specialised brokers or evaluators for design high quality in generated UIs. The assumptions shift, and so does the scaffolding that encodes them.

Anthropic places it cleanly: “Each element in a harness encodes an assumption about what the mannequin can’t do by itself.” When the mannequin will get higher at one thing, that element turns into load-bearing for nothing and will come out. When the mannequin unlocks one thing new, new scaffolding is required to achieve the brand new ceiling.

The model-harness coaching loop

The opposite factor that’s taking place, which Viv names explicitly, is a suggestions loop between harness design and mannequin coaching.

Right now’s agent merchandise are posttrained with harnesses within the loop. The mannequin will get particularly higher on the actions the harness designers suppose it needs to be good at: filesystem operations, bash, planning, subagent dispatch. That’s why Opus 4.6 feels totally different inside Claude Code than inside another person’s harness, and it’s why altering a device’s logic generally causes unusual regressions. A genuinely common mannequin wouldn’t care whether or not you used apply_patch or str_replace, however cotraining creates overfitting.

The sensible implication is twofold. A harness is a dwelling system, not a config file you arrange as soon as. And the “greatest” harness isn’t essentially the one the mannequin was skilled inside; it’s the one designed to your process. Viv’s Prime 30 to Prime 5 Terminal Bench leap is the clearest proof level I’ve seen.

Harness as a service

Viv’s different contribution is the HaaS framing: harness as a service. The statement is that we’re transferring from constructing on LLM APIs (which offer you a completion) to constructing on harness APIs (which offer you a runtime). The Claude Agent SDK, the Codex SDK, and the OpenAI Brokers SDK all level in the identical course. You get the loop, the instruments, the context administration, the hooks, and the sandbox primitives out of the field, and also you customise them.

The shift issues as a result of the default path was once: construct your personal loop, wire up your personal tool-calling, deal with your personal dialog state, invent your personal approval movement. Now the default path is: decide a harness framework, configure it alongside the 4 pillars (system immediate, instruments, context, subagents), and put the remainder of your effort into domain-specific immediate and power design.

That’s what makes “ability challenge” tractable. You’re not rebuilding an agent from scratch each time one thing goes improper. You’re tuning a configuration floor that’s already well-factored.

Viv’s line on that is additionally the most effective argument for beginning messy: “Good agent constructing is an train in iteration. You possibly can’t do iterations if you happen to don’t have a v0.1.”

The place that is going

Take a look at the highest coding brokers aspect by aspect (Claude Code, Cursor, Codex, Aider, Cline) and they give the impression of being extra like one another than their underlying fashions do. The fashions are totally different. The harness patterns are converging. I don’t suppose that’s an accident. It’s the business slowly discovering the load-bearing items of scaffolding that flip a generative mannequin into one thing that may ship.

Viv’s framing of the open issues is the one I discover most fun: orchestrating many brokers working in parallel on a shared codebase; brokers that analyze their very own traces to determine and repair harness-level failure modes; harnesses that dynamically assemble the proper instruments and context just-in-time for a given process as a substitute of being preconfigured at startup.

That final one, particularly, appears like the place harnesses cease being static config and begin changing into one thing nearer to a compiler.