16.4 C
Canberra
Saturday, March 7, 2026

The Unintentional Orchestrator – O’Reilly



That is the primary article in a sequence on agentic engineering and AI-driven improvement. Search for the subsequent article on March 19 on O’Reilly Radar.

There’s been a number of hype about AI and software program improvement, and it is available in two flavors. One says, “We’re all doomed, that instruments like Claude Code will make software program engineering out of date inside a 12 months.” The opposite says, “Don’t fear, all the pieces’s wonderful, AI is simply one other instrument within the toolbox.” Neither is trustworthy.

I’ve spent over 20 years writing about software program improvement for practitioners, overlaying all the pieces from coding and structure to undertaking administration and crew dynamics. For the final two years I’ve been targeted on AI, coaching builders to make use of these instruments successfully, writing about what works and what doesn’t in books, articles, and reviews. And I stored working into the identical downside: I had but to search out anybody with a coherent reply for the way skilled builders ought to truly work with these instruments. There are many suggestions and loads of hype however little or no construction, and little or no you would follow, educate, critique, or enhance.

I’d been observing builders at work utilizing AI with varied ranges of success, and I spotted we have to begin excited about this as its personal self-discipline. Andrej Karpathy, the previous head of AI at Tesla and a founding member of OpenAI, just lately proposed the time period “agentic engineering” for disciplined improvement with AI brokers, and others like Addy Osmani are getting on board. Osmani’s framing is that AI brokers deal with implementation however the human owns the structure, opinions each diff, and assessments relentlessly. I feel that’s proper.

However I’ve spent a number of the final two years instructing builders the best way to use instruments like Claude Code, agent mode in Copilot, Cursor, and others, and what I maintain listening to is that they already know they need to be reviewing the AI’s output, sustaining the structure, writing assessments, conserving documentation present, and staying in charge of the codebase. They know the best way to do it in principle. However they get caught making an attempt to use it in follow: How do you truly assessment hundreds of strains of AI-generated code? How do you retain the structure coherent whenever you’re working throughout a number of AI instruments over weeks? How are you aware when the AI is confidently improper? And it’s not simply junior builders who’re having hassle with agentic engineering. I’ve talked to senior engineers who wrestle with the shift to agentic instruments, and intermediate builders who take to it naturally. The distinction isn’t essentially the years of expertise; it’s whether or not they’ve discovered an efficient and structured solution to work with AI coding instruments. That hole between figuring out what builders ought to be doing with agentic engineering and figuring out the best way to combine it into their day-to-day work is an actual supply of hysteria for lots of engineers proper now. That’s the hole this sequence is making an attempt to fill.

Regardless of what a lot of the hype about agentic engineering is telling you, this type of improvement doesn’t eradicate the necessity for developer experience; simply the alternative. Working successfully with AI brokers truly raises the bar for what builders have to know. I wrote about that have hole in an earlier O’Reilly Radar piece known as “The Cognitive Shortcut Paradox.” The builders who get probably the most from working with AI coding instruments are those who already know what good software program seems to be like, and may typically inform if the AI wrote it.

The concept that AI instruments work finest when skilled builders are driving them matched all the pieces I’d noticed. It rang true, and I needed to show it in a manner that different builders would perceive: by constructing software program. So I began constructing a selected, sensible method to agentic engineering constructed for builders to observe, after which I put it to the take a look at. I used it to construct a manufacturing system from scratch, with the rule that AI would write all of the code. I wanted a undertaking that was complicated sufficient to stress-test the method, and attention-grabbing sufficient to maintain me engaged via the exhausting components. I needed to use all the pieces I’d realized and uncover what I nonetheless didn’t know. That’s once I got here again to Monte Carlo simulations.

The experiment

I’ve been obsessive about Monte Carlo simulations ever since I used to be a child. My dad’s an epidemiologist—his entire profession has been about discovering patterns in messy inhabitants information, which suggests statistics was all the time a part of our lives (and it additionally signifies that I realized SPSS at a really early age). After I was perhaps 11 he informed me in regards to the drunken sailor downside: A sailor leaves a bar on a pier, taking a random step towards the water or towards his ship every time. Does he fall in or make it residence? You may’t know from any single run. However run the simulation a thousand instances, and the sample emerges from the noise. The person final result is random; the combination is predictable.

I bear in mind writing that simulation in BASIC on my TRS-80 Shade Pc 2: a bit of blocky sailor stumbling throughout the display, two steps ahead, one step again. The drunken sailor is the “Howdy, world” of Monte Carlo simulations. Monte Carlo is a method for issues you’ll be able to’t clear up analytically: You simulate them a whole bunch or hundreds of instances and measure the combination outcomes. Every particular person run is random, however the statistics converge on the true reply because the pattern measurement grows. It’s a method we mannequin all the pieces from nuclear physics to monetary threat to the unfold of illness throughout populations.

What in the event you may run that form of simulation as we speak by describing it in plain English? Not a toy demo however hundreds of iterations with seeded randomness for reproducibility, the place the outputs get validated and the outcomes get aggregated into precise statistics you need to use. Or a pipeline the place an LLM generates content material, a second LLM scores it, and something that doesn’t move will get despatched again for one more strive.

The purpose of my experiment was to construct that system, which I known as Octobatch. Proper now, the trade is consistently on the lookout for new real-world end-to-end case research in agentic engineering, and I needed Octobatch to be precisely that case examine.

I took all the pieces I’d realized from instructing and observing builders working with AI, put it to the take a look at by constructing an actual system from scratch, and turned the teachings right into a structured method to agentic engineering I’m calling AI-driven improvement, or AIDD. That is the primary article in a sequence about what agentic engineering seems to be like in follow, what it calls for from the developer, and how one can apply it to your individual work.

The result’s a completely functioning, well-tested utility that consists of about 21,000 strains of Python throughout a number of dozen information, backed by full specs, practically a thousand automated assessments, and high quality integration and regression take a look at suites. I used Claude Cowork to assessment all of the AI chats from the whole undertaking, and it seems that I constructed the whole utility in roughly 75 hours of energetic improvement time over seven weeks. For comparability, I constructed Octobatch in simply over half the time I spent final 12 months enjoying Blue Prince.

However this sequence isn’t nearly Octobatch. I built-in AI instruments at each stage: Claude and Gemini collaborating on structure, Claude Code writing the implementation, LLMs producing the pipelines that run on the system they helped construct. This sequence is about what I realized from that course of: the patterns that labored, the failures that taught me probably the most, and the orchestration mindset that ties all of it collectively. Every article pulls a unique lesson from the experiment, from validation structure to multi-LLM coordination to the values that stored the undertaking on observe.

Agentic engineering and AI-driven improvement

When most individuals discuss utilizing AI to put in writing code, they imply certainly one of two issues: AI coding assistants like GitHub Copilot, Cursor, or Windsurf, which have advanced nicely past autocomplete into agentic instruments that may run multifile enhancing periods and outline customized brokers; or “vibe coding,” the place you describe what you need in pure language and settle for no matter comes again. These coding assistants are genuinely spectacular, and vibe coding will be actually productive.

Utilizing these instruments successfully on an actual undertaking, nevertheless, sustaining architectural coherence throughout hundreds of strains of AI-generated code, is a unique downside completely. AIDD goals to assist clear up that downside. It’s a structured method to agentic engineering the place AI instruments drive substantial parts of the implementation, structure, and even undertaking administration, whilst you, the human within the loop, determine what will get constructed and whether or not it’s any good. By “construction,” I imply a set of practices builders can be taught and observe, a solution to know whether or not the AI’s output is definitely good, and a solution to keep on observe throughout the lifetime of a undertaking. If agentic engineering is the self-discipline, AIDD is one solution to follow it.

In AI-driven improvement, builders don’t simply settle for ideas or hope the output is right. They assign particular roles to particular instruments: one LLM for structure planning, one other for code execution, a coding agent for implementation, and the human for imaginative and prescient, verification, and the choices that require understanding the entire system.

And the “pushed” half is literal. The AI is writing virtually all the code. One among my floor guidelines for the Octobatch experiment was that I’d let AI write all of it. I’ve excessive code high quality requirements, and a part of the experiment was seeing whether or not AIDD may produce a system that meets them. The human decides what will get constructed, evaluates whether or not it’s proper, and maintains the constraints that maintain the system coherent.

Not everybody agrees on how a lot the developer wants to remain within the loop, and the totally autonomous finish of the spectrum is already producing cautionary tales. Nicholas Carlini at Anthropic just lately tasked 16 Claude situations to construct a C compiler in parallel with no human within the loop. After 2,000 periods and $20,000 in API prices, the brokers produced a 100,000-line compiler that may construct a Linux kernel however isn’t a drop-in alternative for something, and when all 16 brokers received caught on the identical bug, Carlini needed to step again in and partition the work himself. Even sturdy advocates of a very hands-off, vibe-driven method to agentic engineering may name {that a} step too far. The query is how a lot human judgment you must make that code reliable, and what particular practices show you how to apply that judgment successfully.

The orchestration mindset

If you wish to get builders excited about agentic engineering in the correct manner, it’s a must to begin with how they consider working with AI, not simply what instruments they use. That’s the place I began once I started constructing a structured method, and it’s why I began with habits. I developed a framework for these known as the Sens-AI Framework, revealed as each an O’Reilly report (Crucial Pondering Habits for Coding with AI) and a Radar sequence. It’s constructed round 5 practices: offering context, doing analysis earlier than prompting, framing issues exactly, iterating intentionally on outputs, and making use of important considering to all the pieces the AI produces. I began there as a result of habits are the way you lock in the way in which you concentrate on the way you’re working. With out them, AI-driven improvement produces plausible-looking code that falls aside underneath scrutiny. With them, it produces techniques {that a} single developer couldn’t construct alone in the identical time-frame.

Habits are the inspiration, however they’re not the entire image. AIDD additionally has practices (concrete methods like multi-LLM coordination, context file administration, and utilizing one mannequin to validate one other’s output) and values (the ideas behind these practices). For those who’ve labored with Agile methodologies like Scrum or XP, that construction ought to be fairly acquainted: Practices inform you the best way to work day-to-day, and habits are the reflexes you develop in order that the practices change into automated.

Values typically appear weirdly theoretical, however they’re an vital piece of the puzzle as a result of they information your choices when the practices don’t offer you a transparent reply. There’s an rising tradition round agentic engineering proper now, and the values you carry to your undertaking both match or conflict with that tradition. Understanding the place the values come from is what makes the practices stick. All of that results in a complete new mindset, what I’m calling the orchestration mindset. This sequence builds all 4 layers, utilizing Octobatch because the proving floor.

Octobatch was a deliberate experiment in AIDD. I designed the undertaking as a take a look at case for the whole method, to see what a disciplined AI-driven workflow may produce and the place it might break down, and I used it to use and enhance the practices and values to make them efficient and straightforward to undertake. And whether or not by intuition or coincidence, I picked the right undertaking for this experiment. Octobatch is a batch orchestrator. It coordinates asynchronous jobs, manages state throughout failures, tracks dependencies between pipeline steps, and makes certain validated outcomes come out the opposite finish. That form of system is enjoyable to design however a number of the main points, like state machines, retry logic, crash restoration, and price accounting, will be tedious to implement. It’s precisely the form of work the place AIDD ought to shine, as a result of the patterns are nicely understood however the implementation is repetitive and error-prone.

Orchestration—the work of coordinating a number of unbiased processes towards a coherent final result—advanced right into a core concept behind AIDD. I discovered myself orchestrating LLMs the identical manner Octobatch orchestrates batch jobs: assigning roles, managing handoffs, validating outputs, recovering from failures. The system I used to be constructing and the method I used to be utilizing to construct it adopted the identical sample. I didn’t anticipate it once I began, however constructing a system that orchestrates AI seems to be a fairly good solution to learn to orchestrate AI. That’s the unintentional a part of the unintentional orchestrator. That parallel runs via each article on this sequence.

Need Radar delivered straight to your inbox? Be a part of us on Substack. Enroll right here.

The trail to batch

I didn’t start the Octobatch undertaking by beginning with a full end-to-end Monte Carlo simulation. I began the place most individuals begin: typing prompts right into a chat interface. I used to be experimenting with totally different simulation and technology concepts to offer the undertaking some construction, and some of them caught. A blackjack technique comparability turned out to be an awesome take a look at case for a multistep Monte Carlo simulation. NPC dialogue technology for a role-playing sport gave me a artistic workload with subjective high quality to measure. Each had the identical form: a set of structured inputs, every processed the identical manner. So I had Claude write a easy script to automate what I’d been doing by hand, and I used Gemini to double-check the work, make certain Claude actually understood my ask, and repair hallucinations. It labored wonderful at small scale, however as soon as I began working greater than 100 or so models, I stored hitting price limits, the caps that suppliers placed on what number of API requests you may make per minute.

That’s what pushed me to LLM batch APIs. As an alternative of sending particular person prompts one after the other and ready for every response, the key LLM suppliers all supply batch APIs that allow you to submit a file containing your whole requests directly. The supplier processes them on their very own schedule; you anticipate outcomes as an alternative of getting them instantly, however you don’t have to fret about price caps. I used to be comfortable to find in addition they price 50% much less, and that’s once I began monitoring token utilization and prices in earnest. However the actual shock was that batch APIs carried out higher than real-time APIs at scale. As soon as pipelines received previous the 100- or 200-unit mark, batch began working considerably sooner than actual time. The supplier processes the entire batch in parallel on their infrastructure, so that you’re not bottlenecked by round-trip latency or price caps anymore.

The change to batch APIs modified how I thought of the entire downside of coordinating LLM API calls at scale, and led to the concept of configurable pipelines. I may chain levels collectively: The output of 1 step may change into the enter to the subsequent, and I may kick off the entire pipeline and are available again to completed outcomes. It seems I wasn’t the one one making the shift to batch APIs. Between April 2024 and July 2025, OpenAI, Anthropic, and Google all launched batch APIs, converging on the identical pricing mannequin: 50% of the real-time price in alternate for asynchronous processing.

You in all probability didn’t discover that every one three main AI suppliers launched batch APIs. The trade dialog was dominated by brokers, instrument use, MCP, and real-time reasoning. Batch APIs shipped with comparatively little fanfare, however they symbolize a real shift in how we will use LLMs. As an alternative of treating them as conversational companions or one-shot SaaS APIs, we will deal with them as processing infrastructure, nearer to a MapReduce job than a chatbot. You give them structured information and a immediate template, and so they course of all of it and hand again the outcomes. What issues is that you would be able to now run tens of hundreds of those transformations reliably, at scale, with out managing price limits or connection failures.

Why orchestration?

If batch APIs are so helpful, why can’t you simply write a for-loop that submits requests and collects outcomes? You may, and for easy circumstances a fast script with a for-loop works wonderful. However when you begin working bigger workloads, the issues begin to pile up. Fixing these issues turned out to be one of the crucial vital classes for creating a structured method to agentic engineering.

First, batch jobs are asynchronous. You submit a job, and outcomes come again hours later, so your script wants to trace what was submitted and ballot for completion. In case your script crashes within the center, you lose that state. Second, batch jobs can partially fail. Possibly 97% of your requests succeeded and three% didn’t. Your code wants to determine which 3% failed, extract them, and resubmit simply these objects. Third, in the event you’re constructing a multistage pipeline the place the output of 1 step feeds into the subsequent, you must observe dependencies between levels. And fourth, you want price accounting. While you’re working tens of hundreds of requests, you wish to know the way a lot you spent, and ideally, how a lot you’re going to spend whenever you first begin the batch. Each certainly one of these has a direct parallel to what you’re doing in agentic engineering: conserving observe of the work a number of AI brokers are doing directly, coping with code failures and bugs, ensuring the whole undertaking stays coherent when AI coding instruments are solely wanting on the one half presently in context, and stepping again to take a look at the broader undertaking administration image.

All of those issues are solvable, however they’re not issues you wish to clear up again and again (in each conditions—whenever you’re orchestrating LLM batch jobs or orchestrating AI coding instruments). Fixing these issues within the code gave some attention-grabbing classes in regards to the general method to agentic engineering. Batch processing strikes the complexity from connection administration to state administration. Actual-time APIs are exhausting due to price limits and retries. Batch APIs are exhausting as a result of it’s a must to observe what’s in flight, what succeeded, what failed, and what’s subsequent.

Earlier than I began improvement, I went on the lookout for current instruments that dealt with this mix of issues, as a result of I didn’t wish to waste my time reinventing the wheel. I didn’t discover something that did the job I wanted. Workflow orchestrators like Apache Airflow and Dagster handle DAGs and activity dependencies, however they assume duties are deterministic and don’t present LLM-specific options like immediate template rendering, schema-based output validation, or retry logic triggered by semantic high quality checks. LLM frameworks like LangChain and LlamaIndex are designed round real-time inference chains and agent loops—they don’t handle asynchronous batch job lifecycles, persist state throughout course of crashes, or deal with partial failure restoration on the chunk stage. And the batch API consumer libraries from the suppliers themselves deal with submission and retrieval for a single batch, however not multistage pipelines, cross-step validation, or provider-agnostic execution.

Nothing I discovered coated the complete lifecycle of multiphase LLM batch workflows, from submission and polling via validation, retry, price monitoring, and crash restoration, throughout all three main AI suppliers. That’s what I constructed.

Classes from the experiment

The purpose of this text, as the primary one in my sequence on agentic engineering and AI-driven improvement, is to put out the speculation and construction of the Octobatch experiment. The remainder of the sequence goes deep on the teachings I realized from it: the validation structure, multi-LLM coordination, the practices and values that emerged from the work, and the orchestration mindset that ties all of it collectively. A number of early classes stand out, as a result of they illustrate what AIDD seems to be like in follow and why developer expertise issues greater than ever.

  • It’s a must to run issues and verify the info. Keep in mind the drunken sailor, the “Howdy, world” of Monte Carlo simulations? At one level I observed that once I ran the simulation via Octobatch, 77.5% of the sailors fell within the water. The outcomes for a random stroll ought to be 50/50, so clearly one thing was badly improper. It turned out the random quantity generator was being re-seeded at each iteration with sequential seed values, which created correlation bias between runs. I didn’t determine the issue instantly; I ran a bunch of assessments utilizing Claude Code as a take a look at runner to generate every take a look at, run it, and log the outcomes; Gemini regarded on the outcomes and located the foundation trigger. Claude had hassle arising with a repair that labored nicely, and proposed a workaround with a big checklist of preseeded random quantity values within the pipeline. Gemini proposed a hash-based repair reviewing my conversations with Claude, but it surely appeared overly complicated. As soon as I understood the issue and rejected their proposed options, I made a decision the most effective repair was less complicated than both of the AI’s ideas: a persistent RNG per simulation unit that superior naturally via its sequence. I wanted to know each the statistics and the code to judge these three choices. Believable-looking output and proper output aren’t the identical factor, and also you want sufficient experience to inform the distinction. (We’ll discuss extra about this case within the subsequent article within the sequence.)
  • LLMs typically overestimate complexity. At one level I needed so as to add help for customized mathematical expressions within the evaluation pipeline. Each Claude and Gemini pushed again, telling me, “That is scope creep for v1.0” and “Reserve it for v1.1.” Claude estimated three hours to implement. As a result of I knew the codebase, I knew we have been already utilizing asteval, a Python library that gives a protected, minimalistic evaluator for mathematical expressions and easy Python statements, elsewhere to judge expressions, so this appeared like a simple use of a library we’re already utilizing elsewhere. Each LLMs thought the answer could be way more complicated and time-consuming than it truly was; it took simply two prompts to Claude Code (generated by Claude), and about 5 minutes complete to implement. The function shipped and made the instrument considerably extra highly effective. The AIs have been being conservative as a result of they didn’t have my context in regards to the system’s structure. Expertise informed me the combination could be trivial. With out that have, I’d have listened to them and deferred a function that took 5 minutes.
  • AI is usually biased towards including code, not deleting it. Generative AI is, unsurprisingly, biased towards technology. So once I requested the LLMs to repair issues, their first response was typically so as to add extra code, including one other layer or one other particular case. I can’t consider a single time in the entire undertaking when one of many AIs stepped again and stated, “Tear this out and rethink the method.” The most efficient periods have been those the place I overrode that intuition and pushed for simplicity. That is one thing skilled builders be taught over a profession: Essentially the most profitable adjustments typically delete greater than they add—the PRs we brag about are those that delete hundreds of strains of code.
  • The structure emerged from failure. The AI instruments and I didn’t design Octobatch’s core structure up entrance. Our first try was a Python script with in-memory state and a number of hope. It labored for small batches however fell aside at scale: A community hiccup meant restarting from scratch, a malformed response required handbook triage. Numerous issues fell into place after I added the constraint that the system should survive being killed at any second. That single requirement led to the tick mannequin (get up, verify state, do work, persist, exit), the manifest file as supply of fact, and the whole crash-recovery structure. We found the design by repeatedly failing to do one thing less complicated.
  • Your improvement historical past is a dataset. I simply informed you many tales from the Octobatch undertaking, and this sequence shall be filled with them. Each a type of tales got here from going again via the chat logs between me, Claude, and Gemini. With AIDD, you might have an entire transcript of each architectural resolution, each improper flip, each second the place you overruled the AI and each second the place it corrected you. Only a few improvement groups have ever had that stage of constancy of their undertaking historical past. Mining these logs for classes realized seems to be one of the crucial beneficial practices I’ve discovered.

Close to the tip of the undertaking, I switched to Cursor to ensure none of this was particular to Claude Code. I created contemporary conversations utilizing the identical context information I’d been sustaining all through improvement, and was capable of bootstrap productive periods instantly; the context information labored precisely as designed. The practices I’d developed transferred cleanly to a unique instrument. The worth of this method comes from the habits, the context administration, and the engineering judgment you carry to the dialog, not from any explicit vendor.

These instruments are shifting the world in a path that favors builders who perceive the methods engineering can go improper and know stable design and structure patterns…and who’re okay letting go of management of each line of code.

What’s subsequent

Agentic engineering wants construction, and construction wants a concrete instance to make it actual. The following article on this sequence goes into Octobatch itself, as a result of the way in which it orchestrates AI is a remarkably shut parallel to what AIDD asks builders to do. Octobatch assigns roles to totally different processing steps, manages handoffs between them, validates their outputs, and recovers once they fail. That’s the identical sample I adopted when constructing it: assigning roles to Claude and Gemini, managing handoffs between them, validating their outputs, and recovering once they went down the improper path. Understanding how the system works seems to be a great way to know the best way to orchestrate AI-driven improvement. I’ll stroll via the structure, present what an actual pipeline seems to be like from immediate to outcomes, current the info from a 300-hand blackjack Monte Carlo simulation that places all of those concepts to the take a look at, and use all of that to reveal concepts we will apply on to agentic engineering and AI-driven improvement.

Later articles go deeper into the practices and concepts I realized from this experiment that make AI-driven improvement work: how I coordinated a number of AI fashions with out shedding management of the structure, what occurred once I examined the code in opposition to what I truly meant to construct, and what I realized in regards to the hole between code that runs and code that does what you meant. Alongside the way in which, the experiment produced some findings about how totally different AI fashions see code that I didn’t count on—and that turned out to matter greater than I believed they’d.

Related Articles

LEAVE A REPLY

Please enter your comment!
Please enter your name here

[td_block_social_counter facebook="tagdiv" twitter="tagdivofficial" youtube="tagdiv" style="style8 td-social-boxed td-social-font-icons" tdc_css="eyJhbGwiOnsibWFyZ2luLWJvdHRvbSI6IjM4IiwiZGlzcGxheSI6IiJ9LCJwb3J0cmFpdCI6eyJtYXJnaW4tYm90dG9tIjoiMzAiLCJkaXNwbGF5IjoiIn0sInBvcnRyYWl0X21heF93aWR0aCI6MTAxOCwicG9ydHJhaXRfbWluX3dpZHRoIjo3Njh9" custom_title="Stay Connected" block_template_id="td_block_template_8" f_header_font_family="712" f_header_font_transform="uppercase" f_header_font_weight="500" f_header_font_size="17" border_color="#dd3333"]
- Advertisement -spot_img

Latest Articles