My Outer Loop | Cocoanetics

May 11, 2026

1

There’s a factor that occurs, after a few weeks of working with coding brokers at a gentle tempo, the place you cease pondering of your self because the individual typing and begin pondering of your self because the individual seeing. The Latin phrase for imaginative and prescient is visio, “I see”; the Italian visione and English imaginative and prescient each hold that. It’s a a lot older concept than the trendy “mission assertion on a slide” utilization. It means: I’ve, in my head, an image of the place this could go.

That image is the factor I’m liable for, now. The typing has been outsourced.

This weblog put up is an try to write down down, truthfully, what my day-to-day appears to be like like a few weeks into working this experiment at full tempo. It is usually — as a labored instance — the story of how SwiftBash, SwiftScript, SwiftPorts, and SwiftJS — 4 tasks you’ve examine right here, every its personal announcement put up — lastly clicked collectively right into a single factor this weekend. It isn’t a put up about any of these 4 tasks specifically. They’ve their very own posts. It’s a put up in regards to the loop that constructed them.

The position of the skilled engineer

The factor I don’t must do anymore is sort. Not the code, not the exams, not the problem textual content, not the PR description, not the commit messages, not the evaluation responses. All of that’s agent work now. What I carry is upstream of all of it — the image in my head: the place these items wish to match, what form the seams between them must be, which abstraction belongs by which package deal, what doesn’t but exist however goes to want to.

It seems a coding agent is sensible on the mechanics — together with the writing-things-down mechanics — and nearly fully with out opinion about which factor must be constructed. You need to carry the opinion. You need to have, very clearly, an image of what beauty like, as a result of the agent will fortunately produce code (and points, and PRs) that look believable all the best way to the check failures, and your solely edge is that you understand what the reply ought to roughly resemble earlier than the agent begins.

That’s the experienced-software-engineer half. The imaginative and prescient. The visio.

What follows is, structurally, my outer loop. It runs all day. There are often two or three of those getting in parallel throughout totally different repos.

The loop, step-by-step

1. An concept will get fleshed out right into a GitHub situation.

The concept is mine. Every part else in regards to the situation isn’t. I describe what I’m seeing — generally a sentence, generally a paragraph, usually only a half-formed nudge — and an agent does the analysis, reads the encompassing code, sketches alternate options, asks me clarifying questions, and finally writes the complete situation textual content: motivation, present state, proposed form, acceptance standards, out-of-scope. I edit. We go a few rounds. The acceptance bullets matter probably the most — they’re the answer-key for all the pieces that comes later. With out them, the agent that picks the problem up has nothing to hill-climb in opposition to. With them, nearly all the pieces else is mechanical.

The problems in the SwiftBash repo, the SwiftPorts one, and the brand new ShellKit repo all appear like this. They’re lengthy. They’re lengthy on function. And primarily not one of the prose in them was typed by me.

2. An agent picks up the problem and works on it domestically.

Native improvement, native exams, a department, finally a pull request. The agent runs the check suite earlier than opening the PR. More often than not it’s inexperienced once I see it.

That is the step the place I’ve to be current the most, though it appears to be like prefer it must be probably the most autonomous. The agent will hit a junction — often a “ought to this stay in package deal A or package deal B?”, or “I see two methods to mannequin this; the sort system doesn’t resolve between them” — and cease and ask. Eighty p.c of these questions are answered with “do it” or “make it so” (Picard, on the bridge of the Enterprise, has been a remarkably helpful position mannequin for this type of work). However the different twenty p.c are style questions: I can see an easier path the agent didn’t contemplate, and if I don’t inform it, Opus will earnestly produce the extra elaborate one. With out the human at this junction, the agent overbuilds. Quietly, plausibly, nevertheless it overbuilds.

The irritating form this takes is that the agent will probably be working away whereas I’m at lunch or asleep, and in some unspecified time in the future it’ll hit one among these questions, cease, and wait. I come again to the keyboard and realise nothing has occurred within the final hour due to a silly clarification it may have requested anybody. There’s, proper now, no good reply to “agent wants human enter however human just isn’t on the keyboard”. That is the a part of the loop I’m not but certain methods to tighten.

3. Codex evaluations the PR.

That is the step I’m most shocked I got here to rely upon. I’ve Codex configured to evaluation each PR I open — and the feedback are really helpful usually sufficient that I learn each single one. Perhaps it’s a lacking edge case. Perhaps a path that wasn’t shell-quoted. Perhaps an HTTP technique that wasn’t within the allow-list. Perhaps a Shell.present that ought to have been learn however wasn’t.

The agent that wrote the PR addresses every remark: 👍 for the nice catches (the bulk — these evaluations are an actual second pair of eyes), 👎 for the false positives (uncommon, however they occur), and the dialog will get marked resolved. I learn alongside.

4. CI runs on each commit. 5 platforms.

Each push, GitHub Actions runs the complete build-and-test matrix on macOS, iOS, Linux, Android, and Home windows. The ambition — for SwiftBash and the encompassing tasks — is that each one 5 keep inexperienced ceaselessly. The CI configuration that will get us there may be its personal story, informed in 4 Inexperienced Checkmarks. It’s now 5.

GitHub Actions is, to be sincere, not quick. A Home windows job can take half an hour; the complete matrix takes longer. It’s additionally not free at scale — the one motive this loop is economically viable for me is that each one of those tasks are open supply, and GitHub provides limitless Actions minutes to public repositories. That’s the deal that makes the entire thing work, and it’s the explanation I’m unlikely to begin a closed-source experiment any time quickly. The large payoff: I shouldn’t have to take care of a Home windows field, an Android emulator, or a Linux VM on my desk. I write Swift on a Mac and watch 4 different platforms inform me whether or not I broke them.

5. The agent watches CI and reacts.

Some platforms are well-behaved. Some — Home windows specifically — have opinions. A check that quotes a path with ahead slashes passes in every single place besides the place backslashes survive bash quoting otherwise. A POSIX exec-bit examine is meaningless on a filesystem that doesn’t have one. getpid() is deprecated below WinSDK. BOOL is bridged otherwise. getaddrinfo lives behind a distinct import.

Every of those is small. Most of them are five-line fixes when you perceive the platform quirk. The artwork is fixing them in a approach that doesn’t un-fix the opposite 4 platforms.

6. Hill climbing.

That is the a part of the day the place the agent and I are at our most helpful to one another. The success criterion is binary — all 5 checkmarks inexperienced — and there’s a finite stack of small, mechanical, platform-shaped failures to grind by way of. The agent reads the CI log, identifies the platform quirk, makes the repair, pushes, and waits. A Home windows CI run can take as much as thirty minutes. Typically a single iteration takes one repair; generally a repair surfaces a brand new failure beneath. You climb. You watch the altimeter.

That is the place Opus is at its most quietly spectacular. It isn’t glamorous work. It’s affected person, particular, mechanical work — precisely the form of work that people get tired of and begin chopping corners on after the third Home windows-only department. The agent doesn’t get bored. So long as I hold the success criterion sharp (“5 inexperienced, no continue-on-error shortcuts”), it retains climbing. The latest push to raise Home windows from “advisory” to “dedicated” — a few dozen platform-specific fixes, ending with a five-line workflow change to delete the continue-on-error: true gates — occurred in a single targeted stretch on the night of Might 8, someplace between dinner and bedtime. About two and a half hours, finish to finish, in opposition to construct steps that take half an hour every.

7. 5 inexperienced checkmarks. Merge.

Repeat.

A cousin loop: exterior PRs

More and more the PRs I’m taking a look at will not be from brokers I began. They’re from exterior contributors. And people contributors, more and more, are utilizing coding brokers themselves.

It is a unusual and barely recursive new sample. I’ve one among my coding brokers evaluation the incoming PR — generally with a couple of questions of my very own sprinkled in — and the dialog that emerges is, in impact, two or three coding brokers and two people collaborating on the form of a change. Typically the PR has missed a design consideration I had in my head; I’ll ask for adjustments. Typically one PR has lumped collectively three separable enhancements; I’ll ask for it to be break up. Typically the PR is simply good — 5 inexperienced checkmarks, the design suits, Codex is joyful, and I merge.

That can be a part of the outer loop. The vision-holding extends throughout the boundaries of the repo.

The instance: how SwiftBash and pals clicked collectively

Now the labored instance. I’ll hold it intentionally high-level — every of those tasks has its personal announcement put up for the gory element.

Two weeks in the past there was SwiftBash: a sandboxed bash interpreter, in pure Swift, with no Course of and no fork. Then SwiftScript: a tree-walking Swift interpreter that wants no toolchain. Then SwiftPorts: pure-Swift reimplementations of gh, glab, git, jq, and the compression household. Then SwiftJS: a Node-shaped runtime on JavaScriptCore.

4 good items. 4 separate good items. Every had its personal personal notion of “the place does stdout go, what’s the working listing, what am I allowed to learn, who am I.” The seams between them didn’t but line up.

I may see, in my head, how they need to match. There wanted to be a fifth package deal — a tiny one — that owned the runtime context: stdio, setting, sandbox, community coverage, id. The 4 runtimes would every undertake it. The bash interpreter would nonetheless personal bash semantics; the Swift interpreter would nonetheless personal Swift semantics; the JS runtime would nonetheless personal JavaScriptCore. However the shell context — the substrate all of them shared — could be one package deal. I referred to as it ShellKit.

That was the imaginative and prescient. Turning it right into a stack of points — one per repo, every with its personal motivation and acceptance standards — was a night’s price of back-and-forth with an agent that did the precise writing. The agent then carried out it throughout three repos over a single weekend: ShellKit received printed, SwiftBash adopted it, SwiftPorts adopted it, the JavaScript runtime received each host-touching floor gated on the brand new shared Shell sort, the SwiftScript shebang dispatch dropped to a five-line bridge. From the primary ShellKit-adoption decide to the final green-Home windows CI run was about seventeen hours of wall-clock time, nearly all of it the agent toiling away on PRs whereas I pointed and reviewed.

What that produced is one thing I’m genuinely joyful about: a single, composable Shell you may construct up with bash instructions, Swift port CLIs, a Swift interpreter, and a JS interpreter, hand a sandbox and a community allow-list to, and run a polyglot pipeline by way of. Not one of the seams creak. The bash shell pipes into jq which pipes right into a Swift script which calls fetch from JavaScript, and each step honors the identical sandbox.

Once more, none of that paragraph is the purpose of this put up. The purpose is: I had an image, the image was appropriate, and the loop took it from image to working substrate over a weekend.

Some surprising revelations

A handful of issues have shocked me about working this manner for the previous couple of weeks.

The position inversion is actual, and quieter than I anticipated. I assumed I’d really feel much less helpful. I really feel extra helpful. The choices I make on the situation stage — what to incorporate, what to scope out, what counts as completed — propagate by way of the loop with extraordinary leverage, they usually’re the one selections which are nonetheless mine to make. A transparent concept produces a transparent situation produces a clear PR. A muddy concept produces a muddy situation produces a muddy PR. The pondering I do up entrance, earlier than the agent ever drafts a phrase, dwarfs something I’d save by skipping it.

Codex genuinely catches issues. That is the one which shocked me most. I anticipated agent-on-agent evaluation to be a form of theatre. It isn’t. The Codex feedback discover actual bugs and actual missed edge instances at a hit-rate that will be respectable for a cautious human reviewer. I’d estimate ninety p.c of the feedback are price a 👍 and an precise repair. Two coding brokers conversing a few PR is a meaningfully totally different evaluation than one agent performing alone.

Hill-climbing is precisely nearly as good because the altimeter. Opus 4.7 is very good on the affected person, repetitive, “repair one platform-quirk at a time” work — however provided that the success criterion is unambiguous. Probably the most irritating moments within the Home windows climb have been those the place the CI log was being truncated and the sign of progress was lacking. The hill is climbable; the altimeter must be studying.

5 platforms adjustments the way you write code. When each commit will get instantly examined on macOS, iOS, Linux, Android, and Home windows, your default reflexes change. You cease reaching for Basis.Course of. You cease assuming POSIX. You design for the smallest frequent floor and add platform-specific niceties on high, relatively than the opposite approach round. This isn’t a self-discipline I’d have adopted alone; the matrix imposed it, and I’m grateful it did.

The day of “weeks of labor” is over for a complete class of process. The four-projects-clicking-together work on this put up would have been, optimistically, two or three weeks of human effort. It was a weekend. I hold mentally re-calibrating what counts as an inexpensive measurement for “this afternoon’s challenge”. The reply retains rising.

How would you tighten this loop?

I’m going to shut on the query I most wish to ask different individuals, as a result of I believe the readers of this weblog are precisely the individuals who’d have a solution.

There are two tender spots in my loop that I’ve not but discovered methods to repair. They don’t seem to be technical bottlenecks — they’re coordination bottlenecks, and I believe they’re the subsequent attention-grabbing frontier.

The “agent caught on a query whereas I’m asleep” downside. I described this above. The agent hits a style query, stops, waits. The clock retains working and nothing occurs. I’d love a setup the place the agent can put up the query right into a channel I’m watching from anyplace — Discord, Slack, an iOS notification, no matter — and I can reply with one faucet, and it picks up the place it left off. The items all exist. No one, so far as I can inform, has wired them collectively but. When you’ve got, please inform me.

The “manually pinging the reviewer” downside. Proper now, when an exterior PR is available in, I nonetheless must go to the appropriate matter thread on my Discord and inform my reviewer-bot (“OpenClaw”) to Assessment PR #N. The evaluation is nice when it lands. The pinging is foolish. Ideally the second GitHub sends the new-PR notification, OpenClaw spins up, evaluations the diff, and presents me with a one-screen abstract plus two buttons: Good to merge and Request these adjustments. I faucet one. Finished. The review-trigger step shouldn’t want a human.

Native CI runners for hill-climbing. Half-hour Home windows builds on GitHub Actions are tremendous for the merge gate, however they’re a awful iteration floor. I hold which means to have a look at working a Home windows VM and an Android emulator on one thing native — even only for the noisy hill-climb section, earlier than the change goes again as much as the cloud matrix because the canonical examine. For those who’ve automated this in a approach that doesn’t double your upkeep burden, I’d genuinely wish to examine it.

There are absolutely extra tender spots — the 4 I’ve listed (issue-stage questions, agent-asleep questions, guide evaluation pings, sluggish CI hill-climbing) are those I stumble upon each day. I’m certain the reader bumps into totally different ones. I’d wish to know which.

For those who’ve discovered a strategy to tighten any a part of this loop on an OSS challenge of your personal — your personal outer-loop diagram, your personal Discord-and-bot incantation, a self-hosted runner setup that earns its hold, a “agent asks, human one-taps” move that already works — I’d love to listen to it. The repos are at Cocoanetics/SwiftBash, Cocoanetics/SwiftScript, Cocoanetics/SwiftPorts, and the brand new Cocoanetics/shellkit. Open a problem, write to me, or — higher but — put up the way you’ve solved a chunk of this and hyperlink me the diagram. Proper now everybody appears to be inventing their outer loop in personal. I’d like for that to cease being the case.

Associated

Classes: Recipes

My Outer Loop | Cocoanetics

The position of the skilled engineer

The loop, step-by-step

A cousin loop: exterior PRs

The instance: how SwiftBash and pals clicked collectively

Some surprising revelations

How would you tighten this loop?

Like this:

Associated

Related Articles

Hackers abuse Google advertisements, Claude.ai chats to push Mac malware

DETER: FAA launches new drone enforcement program

engineer robotic tracks and Seventh axis methods for real-world environments

LEAVE A REPLY Cancel reply

Latest Articles

Hackers abuse Google advertisements, Claude.ai chats to push Mac malware

DETER: FAA launches new drone enforcement program

engineer robotic tracks and Seventh axis methods for real-world environments

Right here’s what it’s essential to know in regards to the cruise ship hantavirus outbreak

3D Folks Case Examine Particulars Improvement of 3D Printed POV Digicam Rig – 3DPrint.com

ABOUT US