The Toolkit Sample – O’Reilly

April 5, 2026

17

That is the third article in a collection on agentic engineering and AI-driven improvement. Learn half one right here, half two right here, and search for the following article on April 15 on O’Reilly Radar.

The toolkit sample is a manner of documenting your challenge’s configuration in order that any AI can generate working inputs from a plain-English description. You and the AI create a single file that describes your instrument’s configuration format, its constraints, and sufficient labored examples that any AI can generate working inputs from a plain-English description. You construct it iteratively, working with the AI (or, higher, a number of AIs) to draft it. You check it by beginning a contemporary AI session and attempting to make use of it, and each time that fails you develop the toolkit from these failures. While you construct the toolkit properly, your customers won’t ever have to find out how your instrument’s configuration recordsdata work, as a result of they describe what they need in dialog and the AI handles the interpretation. Which means you don’t must compromise on the way in which your challenge is configured, as a result of the config recordsdata may be extra complicated and extra full than they’d be if a human needed to edit and perceive them.

To grasp why all of this issues, let me take you again to the mid-Nineteen Eighties.

I used to be 12 years previous, and our household received an AT&T PC 6300, an IBM-compatible that got here with a consumer’s information roughly 159 pages lengthy. Chapter 4 of that guide was referred to as “What Each Person Ought to Know.” It coated issues like tips on how to use the keyboard, tips on how to care to your diskettes, and, memorably, tips on how to label them, full with hand-drawn illustrations and actually helpful recommendation, like how you need to solely use felt-tipped pens, by no means ballpoint, as a result of the strain may injury the magnetic floor.

A page from the AT&T PC 6300 User's Guide, Chapter 4: "Labeling Diskettes" — *A web page from the AT&T PC 6300 Person’s Information, Chapter 4: “Labeling Diskettes”*

I bear in mind being fascinated by this guide. It wasn’t our first laptop. I’d been writing BASIC applications and dialing into BBSs and CompuServe for a few years, so I knew there have been all kinds of fantastic issues you might do with a PC, particularly one with a blazing quick 8MHz processor. However the guide barely talked about any of that. That appeared actually bizarre to me, whilst a child, that you’d give somebody a guide that had an entire web page on utilizing the backspace key to right typing errors (actually!) however didn’t really inform them tips on how to use the factor to do something helpful.

That’s how most developer documentation works. We write the stuff that’s straightforward to put in writing—set up, setup, the getting-started information—as a result of it’s rather a lot simpler than writing the stuff that’s really onerous: the deep rationalization of how all of the items match collectively, the constraints you solely uncover by hitting them, the patterns that separate a configuration that works from one that nearly works. That is one more “searching for your keys beneath the streetlight” drawback: We write the documentation we write as a result of it’s best to put in writing, even when it’s probably not the documentation our customers want.

Builders who got here up via the Unix period know this properly. Man pages have been thorough, correct, and infrequently fully impenetrable for those who didn’t already know what you have been doing. The tar man web page is the canonical instance: It paperwork each flag and choice in exhaustive element, however for those who simply wish to know tips on how to extract a .tar.gz file, it’s virtually ineffective. (The precise flag is -xzvf in case you’re curious.) Stack Overflow exists largely as a result of man pages like tar’s left a niche between what the documentation mentioned and what builders really wanted to know.

And now now we have AI assistants. You possibly can ask Claude or ChatGPT about, say, Kubernetes, Terraform, or React, and also you’ll really get helpful solutions, as a result of these are all established tasks which have been written about extensively and the coaching knowledge is all over the place.

However AI hits a tough wall on the boundary of its coaching knowledge. If you happen to’ve constructed one thing new—a framework, an inner platform, a instrument your crew created—no mannequin has ever seen it. Your customers can’t ask their AI assistant for assist, as a result of the AI doesn’t know your factor even exists.

There’s been quite a lot of nice work transferring AI documentation in the precise course. AGENTS.md tells AI coding brokers tips on how to work in your codebase, treating the AI as a developer. llms.txt offers fashions a structured abstract of your exterior documentation, treating the AI as a search engine. What’s been lacking is a observe for treating the AI as a assist engineer. Each challenge wants configuration: enter recordsdata, choice schemas, workflow definitions, normally within the type of an entire bunch of JSON or YAML recordsdata with cryptic codecs that customers must study earlier than they will do something helpful.

The toolkit sample solves that drawback of getting AIs to put in writing configuration recordsdata for a challenge that isn’t in its coaching knowledge. It consists of a documentation file that teaches any AI sufficient about your challenge’s configuration that it will probably generate working inputs from a plain-English description, with out your customers ever having to study the format themselves. Builders have been arriving at this similar sample (or one thing very comparable) independently from completely different instructions, however so far as I can inform, no one has named it or described a technique for doing it properly. This text distills what I realized from constructing the toolkit for Octobatch pipelines right into a set of practices you possibly can apply to your personal tasks.

Construct the AI its personal guide

Historically, builders face a trade-off with configuration: maintain it easy and straightforward to know, or let it develop to deal with actual complexity and settle for that it now requires a guide. The toolkit sample emerged for me whereas I used to be constructing Octobatch, the batch-processing orchestrator I’ve been writing about on this collection. As I described within the earlier articles on this collection, “The Unintended Orchestrator” and “Preserve Deterministic Work Deterministic,” Octobatch runs complicated multistep LLM pipelines that generate recordsdata or run Monte Carlo simulations. Every pipeline is outlined utilizing a fancy configuration that consists of YAML, Jinja2 templates, JSON schemas, expression steps, and a algorithm tying all of it collectively. The toolkit sample let me sidestep that conventional trade-off.

As Octobatch grew extra complicated, I discovered myself counting on the AIs (Claude and Gemini) to construct configuration recordsdata for me, which turned out to be genuinely invaluable. After I developed a brand new characteristic, I’d work with the AIs to give you the configuration construction to assist it. At first I outlined the configuration, however by the tip of the challenge I relied on the AIs to give you the primary minimize, and I’d push again when one thing appeared off or not forward-looking sufficient. As soon as all of us agreed, I’d have an AI produce the precise up to date config for no matter pipeline we have been engaged on. This transfer to having the AIs do the heavy lifting of writing the configuration was actually invaluable, as a result of it let me create a really strong format in a short time with out having to spend hours updating current configurations each time I modified the syntax or semantics.

Sooner or later I spotted that each time a brand new consumer wished to construct a pipeline, they confronted the identical studying curve and implementation challenges that I’d already labored via with the AIs. The challenge already had a README.md file, and each time I modified the configuration I had an AI replace it to maintain the documentation updated. However by this time, the README.md file was doing manner an excessive amount of work: It was actually complete however an actual headache to learn. It had eight separate subdocuments displaying the consumer tips on how to do just about every part Octobatch supported, and the majority of it was centered on configuration, and it was turning into precisely the type of documentation no one ever needs to learn. That notably bothered me as a author; I’d produced documentation that was genuinely painful to learn.

Trying again at my chats, I can hint how the toolkit sample developed. My first intuition was to construct an AI-assisted editor. About 4 weeks into the challenge, I described the thought to Gemini:

I’m desirous about tips on how to present any type of AI-assisted instrument to assist folks create their very own pipeline. I used to be desirous about a characteristic we’d name “Octobatch Studio” the place we make it straightforward to immediate for modifying pipeline phases, probably aiding in creating the prompts. However possibly as a substitute we embrace quite a lot of documentation in Markdown recordsdata, and count on them to make use of Claude Code, and provides plenty of steering for creating it.

I can really see the pivot to the toolkit sample taking place in actual time on this later message I despatched to Claude. It had sunk in that my customers might use Claude Code, Cursor, or one other AI as interactive documentation to construct their configs precisely the identical manner I’ve been doing:

My plan is to make use of Claude Code because the IDE for creating new pipelines, so individuals who wish to create them can simply spin up Claude Code and begin producing them. Which means we have to give Claude Code particular context recordsdata to inform it every part it must know to create the pipeline YAML config with asteval expressions and Jinja2 template recordsdata.

The normal trade-off between simplicity and suppleness comes from cognitive overhead: the price of holding all of a system’s guidelines, constraints, and interactions in your head whilst you work with it. It’s why many builders go for less complicated config recordsdata, so that they don’t overload their customers (or themselves). As soon as the AI was writing the configuration, that trade-off disappeared. The configs might get as difficult as they wanted to be, as a result of I wasn’t the one who needed to bear in mind how all of the items match collectively. Sooner or later I spotted the toolkit sample was value standardizing.

That toolkit-based workflow—customers describe what they need, the AI reads TOOLKIT.md and generates the config—is the core of the Octobatch consumer expertise now. A consumer clones the repo and opens Claude Code, Cursor, or Copilot, the identical manner they’d with any open supply challenge. Each configuration immediate begins the identical manner: “Learn pipelines/TOOLKIT.md and use it as your information.” The AI reads the file, understands the challenge construction, and guides them step-by-step.

To see what this seems to be like in observe, take the Drunken Sailor pipeline I described in “The Unintended Orchestrator.” It’s a Monte Carlo random stroll simulation: A sailor leaves a bar and stumbles randomly towards the ship or the water. The pipeline configuration for that includes a number of YAML recordsdata, JSON schemas, Jinja2 templates, and expression steps with actual mathematical logic, all wired along with particular guidelines.

Drunken Sailor is Octobatch’s simplest “Hello, World!” Monte Carlo pipeline, but it still has 148 lines of config spread across four files. — *Drunken Sailor is Octobatch’s easiest “Hiya, World!” Monte Carlo pipeline, nevertheless it nonetheless has 148 strains of config unfold throughout 4 recordsdata.*

Right here’s the immediate that generated all of that. The consumer describes what they need in plain English, and the AI produces the complete configuration by studying TOOLKIT.md. That is the precise immediate I gave Claude Code to generate the Drunken Sailor pipeline—discover the primary line of the immediate, telling it to learn the toolkit file.

You don’t need to know Octobatch to understand the prompt I used to create the Drunken Sailor pipeline. — *You don’t have to know Octobatch to know the immediate I used to create the Drunken Sailor pipeline.*

However configuration technology is simply half of what the toolkit file does. Customers also can add TOOLKIT.md and PROJECT_CONTEXT.md (which has details about the challenge) to any AI assistant—ChatGPT, Gemini, Claude, Copilot, no matter they like—and use it as interactive documentation. A pipeline run completed with validation failures? Add the 2 recordsdata and ask what went fallacious. Caught on how retries work? Ask. You possibly can even paste in a screenshot of the TUI and say, “What do I do?” and the AI will learn the display and provides particular recommendation. The toolkit file turns any AI into an on-demand assist engineer to your challenge.

The toolkit helps turn ChatGPT into an AI manual that helps with Octobatch. — *The toolkit helps flip ChatGPT into an AI guide that helps with Octobatch.*

What the Octobatch challenge taught me concerning the toolkit sample

Constructing the generative toolkit for Octobatch produced extra than simply documentation that an AI might use to create configuration recordsdata that labored; it additionally yielded a set of practices, and people practices turn into fairly constant no matter what sort of challenge you’re constructing. Listed below are the 5 that mattered most:

Begin with the toolkit file and develop it from failures. Don’t wait till the challenge is completed to put in writing the documentation. Create the toolkit file first, then let every actual failure add one precept at a time.
Let the AI write the config recordsdata. Your job is product imaginative and prescient—what the challenge ought to do and the way it ought to really feel. The AI’s job is translating that into legitimate configuration.
Preserve steering lean. State the precept, give one concrete instance, transfer on. Each guardrail prices tokens, and bloated steering makes AI efficiency worse.
Deal with each use as a check. There’s no separate testing section for documentation. Each time somebody makes use of the toolkit file to construct one thing, that’s a check of whether or not the documentation works.
Use a couple of mannequin. Completely different fashions catch various things. In a three-model audit of Octobatch, three-quarters of the defects have been caught by just one mannequin.

I’m not proposing a normal format for a toolkit file, and I believe attempting to create one could be counterproductive. Configuration codecs differ wildly from instrument to instrument—that’s the entire drawback we’re attempting to unravel—and a toolkit file that describes your challenge’s constructing blocks goes to look fully completely different from one which describes another person’s. What I discovered is that the AI is completely able to studying no matter you give it, and might be higher at writing the file than you’re anyway, as a result of it’s writing for one more AI. These 5 practices ought to assist construct an efficient toolkit no matter what your challenge seems to be like.

Begin with the toolkit file and develop it from failures

You can begin constructing a toolkit at any level in your challenge. The way in which it occurred for me was natural: After weeks of working with Claude and Gemini on Octobatch configuration, the information about what labored and what didn’t was scattered throughout dozens of chat classes and context recordsdata. I wrote a immediate asking Gemini to consolidate every part it knew concerning the config format—the construction, the foundations, the constraints, the examples, every part we’d talked about—right into a single TOOLKIT.md file. That first model wasn’t nice, nevertheless it was a place to begin, and each failure after that made it higher.

I didn’t plan the toolkit from the start of the Octobatch challenge. It began as a result of I wished my customers to have the ability to construct pipelines the identical manner I had—by working with an AI—however every part they’d want to try this was unfold throughout months of chat logs and the CONTEXT.md recordsdata I’d been sustaining to bootstrap new improvement classes. As soon as I had Gemini consolidate every part right into a single TOOLKIT.md file and had Claude overview it, I handled it the way in which I deal with every other code: Each time one thing broke, I discovered the basis trigger, labored with the AIs to replace the toolkit to account for it, and verified {that a} contemporary AI session might nonetheless use it to generate legitimate configuration.

That incremental method labored properly for me, and it let me check my toolkit the way in which I check every other code: strive it out, discover bugs, repair them, rinse, repeat.

You are able to do the identical factor. If you happen to’re beginning a brand new challenge, you possibly can plan to create the toolkit on the finish. Nevertheless it’s simpler to begin with a easy model early and let it emerge over the course of improvement. That manner you’re dogfooding it the entire time as a substitute of guessing what customers will want.

Let the AI write the config recordsdata (however keep in management!)

Early Octobatch pipelines had easy sufficient configuration {that a} human might learn and perceive them, however not as a result of I used to be writing them by hand. One of many floor guidelines I set for the Octobatch experiment in AI-driven improvement was that the AIs would write the entire code, and that included writing the entire configuration recordsdata. The issue was that regardless that they have been doing the writing, I used to be unconsciously constraining the AIs: pushing again on something that felt too complicated, steering towards constructions I might nonetheless maintain in my head.

Sooner or later I spotted my pushback was inserting a man-made restrict on the challenge. The entire level of getting AIs write the config was that I didn’t have to maintain each single line in my head—it was okay to let the AIs deal with that degree of complexity. As soon as I finished constraining them, the cognitive overhead restrict I described earlier went away. I might have full pipelines outlined in config, together with expression steps with actual mathematical logic, while not having to carry all the foundations and relationships in my head.

As soon as the challenge actually received rolling, I by no means wrote YAML by hand once more. The cycle was all the time: want a characteristic, focus on it with Claude and Gemini, push again when one thing appeared off, and one in every of them produces the up to date config. My job was product imaginative and prescient. Their job was translating that into legitimate configuration. And each config file they wrote was one other check of whether or not the toolkit really labored.

This job delineation, nonetheless, meant inevitable disagreements between me and the AI, and it’s not all the time straightforward to seek out your self disagreeing with a machine as a result of they’re surprisingly cussed (and infrequently shockingly silly). It required persistence and vigilance to remain accountable for the challenge, particularly after I turned over giant obligations to the AIs.

The AIs persistently optimized for technical correctness—separation of considerations, code group, effort estimation—which was nice, as a result of that’s the job I requested them to do. I optimized for product worth. I discovered that retaining that worth as my north star and all the time specializing in constructing helpful options persistently helped with these disagreements.

Preserve steering lean

When you begin rising the toolkit from failures, the pure development is to overdocument every part. Generative AIs are biased towards producing, and it’s straightforward to allow them to get carried away with it. Each bug feels prefer it deserves a warning, each edge case feels prefer it wants a caveat, and earlier than lengthy your toolkit file is bloated with guardrails that price tokens with out including a lot worth. And for the reason that AI is the one writing your toolkit updates, you want to push again on it the identical manner you push again on structure selections. AIs love including WARNING blocks and exhaustive caveats. The self-discipline you want to deliver is telling them when to not add one thing.

The precise degree is to state the precept, give one concrete instance, and belief the AI to use it to new conditions. When Claude Code made a alternative about JSON schema constraints that I might need second-guessed, I needed to resolve whether or not so as to add extra guardrails to TOOLKIT.md. The reply was no—the steering was already there, and the selection it made was really right. If you happen to maintain tightening guardrails each time an AI makes a judgment name, the sign will get misplaced within the noise and efficiency will get worse, not higher. When one thing goes fallacious, the impulse—for each you and the AI—is so as to add a WARNING block. Resist it. One precept, one instance, transfer on.

Deal with each use as a check

There was no separate “testing section” for Octobatch’s TOOLKIT.md. Each pipeline that I created with it was a brand new check. After the very first model, I opened a contemporary Claude Code session that had by no means seen any of my improvement conversations, pointed it on the newly minted TOOLKIT.md, and requested it to construct a pipeline. The primary time I attempted it, I used to be stunned at how properly it labored! So I saved utilizing it, and because the challenge rolled alongside, I up to date it with each new characteristic and examined these updates. When one thing failed, I traced it again to a lacking or unclear rule within the toolkit and stuck it there.

That’s the sensible check for any toolkit: open a contemporary AI session with no context past the file, describe what you need in plain English, and see if the output works. If it doesn’t, the toolkit has a bug.

Use a couple of mannequin

While you’re constructing and testing your toolkit, don’t simply use one AI. Run the identical activity via a second mannequin. A very good sample that labored for me was persistently having Claude generate the toolkit and Gemini test its work.

Completely different fashions catch various things, and this issues for each growing and testing the toolkit. I used Claude and Gemini collectively all through Octobatch improvement, and I overruled each after they have been fallacious about product intent. You are able to do the identical factor: If you happen to work with a number of AIs all through your challenge, you’ll begin to get a really feel for the completely different sorts of questions they’re good at answering.

When you’ve gotten a number of fashions generate config from the identical toolkit independently, you discover out quick the place your documentation is ambiguous. If two fashions interpret the identical rule otherwise, the rule wants rewriting. That’s a sign you possibly can’t get from utilizing only one mannequin.

The guide, revisited

That AT&T PC 6300 guide devoted a full web page to labeling diskettes, which can have been overkill, nevertheless it received one factor proper: it described the constructing blocks and trusted the reader to determine the remainder. It simply had the fallacious reader in thoughts.

The toolkit sample is similar thought, pointed at a special viewers. You write a file that describes your challenge’s configuration format, its constraints, and sufficient labored examples that any AI can generate working inputs from a plain-English description. Your customers by no means must study YAML or memorize your schema, as a result of they’ve a dialog with the AI and it handles the interpretation.

If you happen to’re constructing a challenge and also you need AI to have the ability to assist your customers, begin right here: write the toolkit file earlier than you write the README, develop it from actual failures as a substitute of attempting to plan all of it upfront, maintain it lean, check it by utilizing it, and use a couple of mannequin as a result of no single AI catches every part.

The AT&T guide’s Chapter 4 was referred to as “What Each Person Ought to Know.” Your toolkit file is “What Each AI Ought to Know.” The distinction is that this time, the reader will really use it.

Within the subsequent article, I’ll begin with a statistic about developer belief in AI-generated code that turned out to be fabricated by the AI itself—and use that to elucidate why I constructed a top quality playbook that revives the normal high quality practices most groups minimize many years in the past. It explores an unfamiliar codebase, generates an entire high quality infrastructure—assessments, overview protocols, validation guidelines—and finds actual bugs within the course of. It really works throughout Java, C#, Python, and Scala, and it’s accessible as an open supply Claude Code talent.

The Toolkit Sample – O’Reilly

Construct the AI its personal guide

What the Octobatch challenge taught me concerning the toolkit sample

Begin with the toolkit file and develop it from failures

Let the AI write the config recordsdata (however keep in management!)

Preserve steering lean

Deal with each use as a check

Use a couple of mannequin

The guide, revisited

Related Articles

The Gemini-Powered Google Residence Speaker Is Lastly Right here

Zellerfeld Acquires Volumental: What It Means | VoxelMatters

DataRobot for Builders — integrating with the Google Antigravity CLI

LEAVE A REPLY Cancel reply

Latest Articles

The Gemini-Powered Google Residence Speaker Is Lastly Right here

Zellerfeld Acquires Volumental: What It Means | VoxelMatters

DataRobot for Builders — integrating with the Google Antigravity CLI

AI-assisted information growth with Kiro and SageMaker Unified Studio

Malicious JetBrains Plugins Steal AI API Keys as Chrome Extensions Seize Chatbot Chats

ABOUT US