22 C
Canberra
Tuesday, March 24, 2026

Preserve Deterministic Work Deterministic – O’Reilly


That is the second article in a sequence on agentic engineering and AI-driven growth. Learn half one right here, and search for the subsequent article on April 2 on O’Reilly Radar.

The primary 90 % of the code accounts for the primary 90 % of the event time. The remaining 10 % of the code accounts for the opposite 90 % of the event time.
Tom Cargill, Bell Labs

One of many experiments I’ve been operating as a part of my work on agentic engineering and AI-driven growth is a blackjack simulation the place an LLM performs a whole lot of fingers in opposition to blackjack methods written in plain English. The AI makes use of these technique descriptions to resolve learn how to make hit/stand/double-down choices for every hand, whereas deterministic code offers the playing cards, checks the maths, and verifies that the foundations had been adopted accurately.

Early runs of my simulation had a 37% cross price. The LLM would add up card totals unsuitable, skip the vendor’s flip completely, or ignore the technique it was alleged to comply with. The massive drawback was that these errors compounded: If the mannequin miscounted the participant’s complete on the third card, each choice after that was based mostly on a unsuitable quantity, so the entire hand was rubbish even when the remainder of the logic was fantastic.

There’s a helpful method to consider reliability issues like that: the March of Nines. Getting an LLM-based system to 90% reliability is the primary 9, and it’s the “simple” one. Getting from 90% to 99% takes roughly the identical quantity of engineering effort. So does getting from 99% to 99.9%. Every 9 prices about as a lot because the final, and also you by no means cease marching. Andrej Karpathy coined the time period from his expertise constructing self-driving programs at Tesla, the place they spent years incomes two or three nines and nonetheless had extra to go.

Right here’s a small train that reveals how that form of failure compounding works. Open any AI chatbot operating an early 2026 mannequin (I used ChatGPT 5.3 Instantaneous) and paste the next eight prompts one after the other, every in a separate message. Go forward, I’ll wait.

Immediate 1: Observe a operating “rating” by means of a 7-step recreation. Don’t use code, Python, or instruments. Do that completely in your head. For every step, I offers you a sentence and a rule.

CRITICAL INSTRUCTION: You should reply with ONLY the mathematical equation exhibiting the way you up to date the rating. Instance format: 10 + 5 = 15 or 20 / 2 = 10. Don’t listing the phrases you counted, don’t clarify your reasoning, and don’t write some other textual content. Simply the equation.

Begin with a rating of 10. I’ll provide the first step within the subsequent immediate.

Immediate 2: “The sudden blizzard chilled the small village communities.” Add the variety of phrases containing double letters (two of the very same letter back-to-back, like ‘tt’ or ‘mm’).

Immediate 3: “The intelligent engineer wanted seven excellent items of cheese.” In case your rating is ODD, add the variety of phrases that include EXACTLY two ‘e’s. In case your rating is EVEN, subtract the variety of phrases that include EXACTLY two ‘e’s. (Don’t depend phrases with one, three, or zero ‘e’s).

Immediate 4: “The great sailor joined the keen crew aboard the picket boat.” In case your rating is larger than 10, subtract the variety of phrases containing consecutive vowels (two completely different or an identical vowels back-to-back, like ‘ea’, ‘oo’, or ‘oi’). In case your rating is 10 or much less, multiply your rating by this quantity.

Immediate 5: “The short brown fox jumps over the lazy canine.” Add the variety of phrases the place the THIRD letter is a vowel (a, e, i, o, u).

Immediate 6: “Three courageous kings stand beneath black skies.” In case your rating is an ODD quantity, subtract the variety of phrases which have precisely 5 letters. In case your rating is an EVEN quantity, multiply your rating by the variety of phrases which have precisely 5 letters.

Immediate 7: “Look down, you shy owl, go fly away.” Subtract the variety of phrases that include NONE of those letters: a, e, or i.

Immediate 8: “Inexperienced apples fall from tall timber.” In case your rating is larger than 15, subtract the variety of phrases containing the letter ‘a’. In case your rating is 15 or much less, add the variety of phrases containing the letter ‘l’.

The train tracks a operating rating by means of seven steps. Every step provides the mannequin a sentence and a counting rule, and the rating carries ahead. The right last rating is 60. Right here’s the reply key: begin at 10, then 16 (10+6), 12 (16−4), 5 (12−7), 10 (5+5), 70 (10×7), 63 (70−7), 60 (63−3).

I ran this twice on the identical time (utilizing ChatGPT 5.3 Instantaneous), and acquired two fully completely different unsuitable solutions the primary time I attempted it. Neither run reached the right rating of 60:

Step Right Run 1 (transcript) Run 2 (transcript)
1. Double letters 10 + 6 = 16 10 + 2 = 12 ❌ 10 + 5 = 15 ❌
2. Precisely two ‘e’s 16 − 4 = 12 12 − 4 = 8 ❌ 15 + 4 = 19 ❌
3. Consecutive vowels 12 − 7 = 5 8 × 7 = 56 ❌ 19 − 5 = 14 ❌
4. Third letter vowel 5 + 5 = 10 56 + 5 = 61 ❌ 14 + 3 = 17 ❌
5. Precisely 5 letters 10 × 7 = 70 61 − 7 = 54 ❌ 17 − 4 = 13 ❌
6. No a, e, or i 70 − 7 = 63 54 − 7 = 47 ❌ 13 − 3 = 10 ❌
7. Phrases with ‘a’ or ‘i’ 63 − 3 = 60 47 − 3 = 44 ❌ 10 + 4 = 14 ❌

The 2 runs inform very completely different tales. In Run 1, the mannequin miscounted in Step 1 (discovered 2 double-letter phrases as an alternative of 6) however truly acquired the later counts proper. It didn’t matter. The unsuitable rating in Step 1 flipped a department in Step 3, triggering a multiply as an alternative of a subtract, and the rating by no means recovered. One early mistake threw off all the chain, although the mannequin was doing good work after that.

Run 2 was a catastrophe. The mannequin miscounted at virtually each step, compounding errors on prime of errors. It ended at 14 as an alternative of 60. That’s nearer to what Karpathy is describing with the March of Nines: Every step has its personal reliability ceiling, and the longer the chain, the upper the prospect that at the very least one step fails and corrupts every thing downstream.

What makes this insidious: Each runs look the identical from the skin. Every step produced a believable reply, and each runs produced last outcomes. With out the reply key (or some tedious handbook checking), you’d don’t have any method of figuring out that Run 1 was a near-miss derailed by a single early error and Run 2 was unsuitable at almost each step. That is typical of any course of the place the output of 1 LLM name turns into the enter for the subsequent one.

These failures don’t display the March of Nines itself—that’s particularly in regards to the engineering effort to push reliability from 90% to 99% to 99.9%. (It’s potential to breed the complete compounding-reliability drawback in a chat, however a immediate that did it reliably could be far too lengthy to place in an article.) As a substitute, I opted for a shorter train which you’ll simply check out your self that demonstrates the underlying drawback that makes the march so laborious: cascading failures. Every step asks the mannequin to depend letters inside phrases, which is deterministic work {that a} brief Python script handles completely. LLMs, then again, don’t truly deal with phrases as strings of characters; they see them as tokens. Recognizing double letters means unpacking a token into its characters, and the mannequin will get that unsuitable simply usually sufficient to reliably screw it up. I added branching logic the place every step’s outcome determines the subsequent step’s operation, so a single miscount in Step 1 cascades by means of all the sequence.

I additionally wish to be clear about precisely what a deterministic model of this simulation seems to be like. Fortunately, the AI may help us with that. Go to both run (or your personal) and paste another immediate into the chat:

Immediate 9: Now write a brief Python script that does precisely what you simply did: begin with a rating of 10, apply every of the seven guidelines to the seven sentences, and print the equation at every step.

Run the script. It ought to print the right reply for each step, ending at 60. The identical AI that simply failed the train can write code that does it flawlessly, as a result of now it’s producing deterministic logic as an alternative of making an attempt to depend characters by means of its tokenizer.

Reproducing a cascading failure in a chat

I intentionally engineered the train earlier to provide you a approach to expertise the cascading failure drawback behind the March of Nines your self. I took benefit of one thing present LLMs genuinely suck at: parsing characters inside tokens. Future fashions would possibly do a significantly better job with this particular form of failure, however the cascading failure drawback doesn’t go away when the mannequin will get smarter. So long as LLMs are nondeterministic, any step that depends on them has a reliability ceiling under 100%, and people ceilings nonetheless multiply. The particular weak point adjustments; the maths doesn’t.

I additionally particularly requested the mannequin to point out solely the equation and skip all intermediate reasoning to forestall it from utilizing chain of thought (or CoT) to self-correct. Chain of thought is a way the place you require the mannequin to point out its work step-by-step (for instance, itemizing the phrases it counted and explaining why every one qualifies), which helps it catch its personal errors alongside the best way. CoT is a typical method to enhance LLM accuracy, and it really works. As you’ll see later once I discuss in regards to the evolution of my blackjack simulation, CoT reduce sure errors roughly in half. However “half as many errors” continues to be not zero. Plus, it’s costly: It prices extra tokens and extra time. A Python script that counts double letters will get the correct reply on each run, immediately, for zero AI API prices (or, in case you’re operating the AI domestically, for orders of magnitude much less CPU utilization). That’s the core pressure: You’ll be able to spend engineering effort making the LLM higher at deterministic work, or you may simply hand it to code.

Each step on this train is deterministic work that code handles flawlessly. However most fascinating LLM duties aren’t like that. You’ll be able to’t write a deterministic script that performs a hand of blackjack utilizing natural-language technique guidelines, or decides how a personality ought to reply in dialogue. Actual work requires chaining a number of steps collectively right into a pipeline, or a reproducible sequence of steps (some deterministic, some requiring an LLM) that result in a single outcome, the place every step’s output feeds the subsequent. If that feels like what you simply noticed within the train, it’s. Besides actual pipelines are longer, extra advanced, and far more durable to debug when one thing goes unsuitable within the center.

LLM pipelines are particularly vulnerable to the March of Nines

I’ve been spending a number of time fascinated about LLM pipelines, and I think I’m within the minority. Most individuals utilizing LLMs are working with single prompts or brief conversations. However when you begin constructing multistep workflows the place the AI generates structured information that feeds into the subsequent step—whether or not that’s a content material technology pipeline, a knowledge processing chain, or a simulation—you run straight into the March of Nines. Every step has a reliability ceiling, and people ceilings multiply. The train you simply tried had seven steps. The blackjack pipeline has extra, and I’ve been operating it a whole lot of occasions per iteration.

The blackjack pipeline in Octobatch
The blackjack pipeline in Octobatch, an open supply batch orchestrator for multistep LLM workflows that I launched in “The Unintentional Orchestrator.”

That’s a screenshot of the blackjack pipeline in Octobatch, the device I constructed to run these pipelines at scale. That pipeline offers playing cards deterministically, asks the LLM to play every hand following a method described in plain English, then validates the outcomes with deterministic code. Octobatch makes it simple to vary the pipeline and rerun a whole lot of fingers, which is how I iterated by means of eight variations—and the way I actually discovered the laborious method that the March of Nines wasn’t only a theoretical drawback however one thing I might watch taking place in actual time throughout a whole lot of knowledge factors.

Working pipelines at scale made the failures apparent and quick, which, for me, actually underscored an efficient strategy to minimizing the cascading failure drawback: make deterministic work deterministic. Which means asking whether or not each step within the pipeline truly must be an LLM name. Checking {that a} jack, a 5, and an eight add as much as 23 doesn’t require a language mannequin. Neither does trying up whether or not standing on 15 in opposition to a vendor 10 follows primary technique. That’s arithmetic and a lookup desk—work that bizarre code does completely each time. And as I discovered over the course of enhancing the failure price for the pipeline, each step you pull out of the LLM and make deterministic goes to 100% reliability, which stops it from contributing to the compound failure price.

Counting on the AI for deterministic work is the computation facet of a sample I wrote about for information in “AI, MCP, and the Hidden Prices of Information Hoarding.” Groups dump every thing into the AI’s context as a result of the AI can deal with it—till it will possibly’t. The identical factor occurs with computation: Groups let the AI do arithmetic, string matching, or rule analysis as a result of it principally works. However “principally works” is dear and sluggish, and a brief script does it completely. Higher but, the AI can write that script for you—which is precisely what Immediate 9 demonstrated.

Getting cascading failures out of the blackjack pipeline

I pushed the blackjack pipeline by means of eight iterations, and the outcomes taught me extra about incomes nines than I anticipated. That’s why I’m writing this text—the iteration arc turned out to be one of many clearest illustrations I’ve discovered of how the precept works in apply.

I addressed failures two methods, and the excellence issues.

Some failures referred to as for making work deterministic. Card dealing runs as a neighborhood expression step, which doesn’t require an API name, so it’s free, immediate, and 100% reproducible. There’s a math verification step that makes use of code to recalculate totals from the precise playing cards dealt and compares them in opposition to what the LLM reported, and a method compliance step checks the participant’s first motion in opposition to a deterministic lookup desk. Neither of these steps require any AI to make a judgment name; once I initially ran them as LLM calls, they launched errors that had been laborious to detect and costly to debug.

Different failures referred to as for structural constraints that made particular error patterns more durable to provide. Chain of thought format pressured the LLM to point out its work as an alternative of leaping to conclusions. The inflexible vendor output construction made it mechanically tough to skip the vendor’s flip. Specific warnings about counterintuitive guidelines gave the LLM a purpose to override its coaching priors. These don’t get rid of the LLM from the step—they make the LLM extra dependable inside it.

However earlier than any of that mattered, I needed to face the uncomfortable undeniable fact that measurements themselves will be unsuitable, particularly when counting on AI to take these measurements. For instance, the primary run reported a 57% cross price, which was nice! However once I regarded on the information myself, a number of runs had been clearly unsuitable. It turned out that the pipeline had a bug: Verification steps had been operating, however the AI step that was alleged to implement didn’t have sufficient guardrails, so virtually each hand handed whatever the precise information. I requested three AI advisors to assessment the pipeline, and none of them caught it. The one factor that uncovered it was checking the combination numbers, which didn’t add up. If you happen to let probabilistic habits right into a step that ought to be deterministic, the output will look believable and the system will report success, however you haven’t any approach to know one thing’s unsuitable till you go on the lookout for it.

As soon as I mounted the bug, the actual cross price emerged: 31%. Right here’s how the subsequent seven iterations performed out:

  • Restructuring the info (31% → 37%). The LLM stored dropping observe of the place it was within the deck, so I restructured the info it obtained to get rid of the bookkeeping. I additionally eliminated cut up fingers completely, as a result of monitoring two simultaneous fingers is stateful bookkeeping that LLMs reliably botch. Every repair got here from taking a look at what was truly failing and asking whether or not the LLM wanted to be doing that work in any respect.
  • Chain of thought arithmetic (37% → 48%). As a substitute of letting the LLM bounce to a last card complete, I required it to point out the operating math at each step. Forcing the mannequin to hint its personal calculations reduce multidraw errors roughly in half. CoT is a structural constraint, not a deterministic alternative; it makes the LLM extra dependable throughout the step, however it’s additionally costlier as a result of it makes use of extra tokens and takes extra time.
  • Changing the LLM validator with deterministic code (48% → 79%). This was the one largest enchancment in all the arc. The pipeline had a second LLM name that scored how precisely the participant adopted technique, and it was unsuitable 73% of the time. It utilized its personal blackjack intuitions as an alternative of the foundations I’d given it. However there’s a proper reply for each scenario in primary technique, and the foundations will be written as a lookup desk. Changing the LLM validator with a deterministic expression step recovered over 150 incorrectly rejected fingers.
  • Inflexible output format (79% → 81%). The LLM stored skipping the vendor’s flip completely, leaping straight to declaring a winner. Requiring a step-by-step vendor output format made it mechanically tough to skip forward.
  • Overriding the mannequin’s priors (81% → 84%). One technique required hitting on 18 in opposition to a excessive vendor card, which any typical blackjack knowledge says is horrible. The LLM refused to do it. Restating the rule didn’t assist. Explaining why the counterintuitive rule exists did: The immediate needed to inform the mannequin that the unhealthy play was intentional.
  • Switching fashions (84% → 94%). I switched from Gemini Flash 2.0 to Haiku 4.6, which was simple to do as a result of Octobatch allows you to run the identical pipeline with any mannequin from Gemini, Anthropic, or OpenAI. I lastly earned my first 9.

Discover the perfect methods to earn your nines

If you happen to’re constructing something the place LLM output feeds into the subsequent step, the identical query applies to each step in your chain: Does this truly require judgment, or is it deterministic work that ended up within the LLM as a result of the LLM can do it? The technique validator felt like a judgment name till I checked out what it was truly doing, which was checking a hand in opposition to a lookup desk. That one recognition was value greater than all of the immediate engineering mixed. And as Immediate 9 confirmed, the AI is usually the perfect device for writing its personal deterministic alternative.

I discovered this lesson by means of my very own work on the blackjack pipeline. It went by means of eight iterations, and I feel the numbers inform a narrative. The fixes fell into two classes: making work deterministic (pulling it out of the LLM completely) and including structural constraints (making the LLM extra dependable inside a step). Each earn nines, however pulling work out of the LLM completely earns these nines quicker. The largest single bounce in the entire arc—48% to 79%—got here from changing an LLM validator with a 10-line expression.

Right here’s the underside line for me: If you happen to can write a brief operate that does the job, don’t give it to the LLM. I initially reached for the LLM for technique validation as a result of it felt like a judgment name, however as soon as I regarded on the information I spotted it wasn’t in any respect. There was a proper reply for each hand, and a lookup desk discovered it extra reliably than a language mannequin.

On the finish of eight iterations, the pipeline handed 94% of fingers. The 6% that also fail could also be trustworthy limits of what the mannequin can do with multistep arithmetic and state monitoring in a single immediate. However they might simply be the subsequent 9 that I must earn.

The following article seems to be on the different facet of this drawback: As soon as you understand what to make deterministic, how do you make the entire system legible sufficient that an AI may help your customers construct with it? The reply seems to be a form of documentation you write for AI to learn, not people—and it adjustments the best way you concentrate on what a consumer handbook is for.

Related Articles

LEAVE A REPLY

Please enter your comment!
Please enter your name here

[td_block_social_counter facebook="tagdiv" twitter="tagdivofficial" youtube="tagdiv" style="style8 td-social-boxed td-social-font-icons" tdc_css="eyJhbGwiOnsibWFyZ2luLWJvdHRvbSI6IjM4IiwiZGlzcGxheSI6IiJ9LCJwb3J0cmFpdCI6eyJtYXJnaW4tYm90dG9tIjoiMzAiLCJkaXNwbGF5IjoiIn0sInBvcnRyYWl0X21heF93aWR0aCI6MTAxOCwicG9ydHJhaXRfbWluX3dpZHRoIjo3Njh9" custom_title="Stay Connected" block_template_id="td_block_template_8" f_header_font_family="712" f_header_font_transform="uppercase" f_header_font_weight="500" f_header_font_size="17" border_color="#dd3333"]
- Advertisement -spot_img

Latest Articles