
There’s an open secret on the earth of DevOps: No one trusts the CMDB. The Configuration Administration Database (CMDB) is meant to be the “supply of fact”—the central map of each server, service, and software in your enterprise. In concept, it’s the muse for safety audits, value evaluation, and incident response. In observe, it’s a piece of fiction. The second you populate a CMDB, it begins to rot. Engineers deploy a brand new microservice however overlook to register it. An autoscaling group spins up 20 new nodes, however the database solely information the unique three. . .
We name this configuration drift, and for many years, our business’s answer has been to throw extra scripts on the drawback. We write large, brittle ETL (Extract-Remodel-Load) pipelines that try and scrape the world and shove it right into a relational database. It by no means works. The “world”—particularly the fashionable cloud native world—strikes too quick.
We realized we couldn’t resolve this drawback by writing higher scripts. We needed to change the elemental structure of how we sync information. We stopped attempting to boil the ocean and repair the complete enterprise directly. As an alternative, we targeted on one notoriously tough setting: Kubernetes. If we may construct an autonomous agent able to reasoning in regards to the advanced, ephemeral state of a Kubernetes cluster, we may show a sample that works in every single place else. This text explores how we used the newly open-sourced Codex CLI and theMannequin Context Protocol (MCP) to construct that agent. Within the course of, we moved from passive code technology to lively infrastructure operation, remodeling the “stale CMDB” drawback from an information entry process right into a logic puzzle.
The Shift: From Code Era to Infrastructure Operation with Codex CLI and MCP
The explanation most CMDB initiatives fail is ambition. They attempt to monitor each swap port, digital machine, and SaaS license concurrently. The result’s an information swamp—an excessive amount of noise, not sufficient sign. We took a distinct method. We drew a small circle round a particular area: Kubernetes workloads. Kubernetes is the right testing floor for AI brokers as a result of it’s high-velocity and declarative. Issues change always. Pods die; deployments roll over; companies change selectors. A static script struggles to tell apart between a CrashLoopBackOff (a short lived error state) and a purposeful scale-down. We hypothesized that a big language mannequin (LLM), performing as an operator, may perceive this nuance. It wouldn’t simply copy information; it will interpret it.
The Codex CLI turned this speculation right into a tangible structure by enabling a shift from “code technology” to “infrastructure operation.” As an alternative of treating the LLM as a junior programmer that writes scripts for people to overview and run, Codex empowers the mannequin to execute code itself. We offer it with instruments—executable features that act as its fingers and eyes—through the Mannequin Context Protocol. MCP defines a transparent interface between the AI mannequin and the skin world, permitting us to reveal high-level capabilities like cmdb_stage_transaction with out instructing the mannequin the advanced inside API of our CMDB. The mannequin learns to make use of the instrument, not the underlying API.
The structure of company
Our system, which we name k8s-agent, consists of three distinct layers. This isn’t a single script operating high to backside; it’s a cognitive structure.
The cognitive layer (Codex + contextual directions): That is the Codex CLI operating a particular system immediate. We don’t fine-tune the mannequin weights. Infrastructure strikes too quick for fine-tuning: A mannequin educated on Kubernetes v1.25 can be hallucinating by v1.30. As an alternative, we use context engineering—the artwork of designing the setting wherein the AI operates. This entails instrument design (creating atomic, deterministic features), immediate structure (structuring the system immediate), and knowledge structure (deciding what info to cover or expose). We feed the mannequin a persistent context file (AGENTS.md) that defines its persona: “You’re a meticulous infrastructure auditor. Your objective is to make sure the CMDB precisely displays the state of the Kubernetes cluster. You need to prioritize security: Don’t delete information until you may have constructive affirmation; they’re orphans.”
The instrument layer: Utilizing MCP, we expose deterministic Python features to the agent.
- Sensors: k8s_list_workloads, cmdb_query_service, k8s_get_deployment_spec
- Actuators: cmdb_stage_create, cmdb_stage_update, cmdb_stage_delete
Be aware that we monitor workloads (Deployments, StatefulSets), not Pods. Pods are ephemeral; monitoring them in a CMDB is an antipattern that creates noise. The agent understands this distinction—a semantic rule that’s exhausting to implement in a inflexible script.
The state layer (the protection internet): LLMs are probabilistic; infrastructure have to be deterministic. We bridge this hole with a staging sample. The agent by no means writes on to the manufacturing database. It writes to a staged diff. This permits a human (or a coverage engine) to overview the proposed adjustments earlier than they’re dedicated.
The OODA Loop in Motion
How does this differ from a typical sync script? A script follows a linear path: Join → Fetch → Write. If any step fails or returns surprising information, the script crashes or corrupts information. Our agent follows the Observe-Orient-Determine-Act (OODA) loop, popularized by army strategists. Not like a linear script that executes blindly, the OODA loop forces the agent to pause and synthesize info earlier than taking motion. This cycle permits it to deal with incomplete information, confirm assumptions, and adapt to altering situations—traits important for working in a distributed system.
Let’s stroll by way of an actual state of affairs we encountered throughout our pilot, the Ghost Deployment, to discover the advantages of utilizing an OODA loop. A developer had deleted a deployment named payment-processor-v1 from the cluster however forgot to take away the file from the CMDB. A normal script would possibly pull the listing of deployments, see payment-processor-v1 is lacking, and instantly problem a DELETE to the database. The danger is clear: What if the API server was simply timing out? What if the script had a bug in its pagination logic? The script blindly destroys information primarily based on the absence of proof.
The agent method is basically totally different. First, it observes: Calling k8s_list_workloads and cmdb_query_service, noticing the discrepancy. Second, it orients: Checking its context directions to “confirm orphans earlier than deletion” and deciding to name k8s_get_event_history. Third, it decides: Seeing a “delete” occasion within the logs, it causes that the useful resource is lacking and that there’s been a deletion occasion. Lastly, it acts: Calling cmdb_stage_delete with a remark confirming the deletion. The agent didn’t simply sync information; it investigated. It dealt with the paradox that often breaks automation.
Fixing the “Semantic Hole”
This particular Kubernetes use case highlights a broader drawback in IT operations: the “semantic hole.” The information in our infrastructure (JSON, YAML, logs) is stuffed with implicit that means. A label “env: manufacturing” adjustments the criticality of a useful resource. A standing CrashLoopBackOff means “damaged,” however Accomplished means “completed efficiently.” Conventional scripts require us to hardcode each permutation of this logic, leading to hundreds of strains of unmaintainable if/else statements. With the Codex CLI, we change these hundreds of strains of code with a number of sentences of English within the system immediate: “Ignore jobs which have accomplished efficiently. Sync failing Jobs so we are able to monitor instability.” The LLM bridges the semantic hole. It understands what “instability” implies within the context of a job standing. We’re describing our intent, and the agent is dealing with the implementation.
Scaling Past Kubernetes
We began with Kubernetes as a result of it’s the “exhausting mode” of configuration administration. In a manufacturing setting with hundreds of workloads, issues change always. A normal script sees a snapshot and infrequently will get it unsuitable. An agent, nonetheless, can work by way of the complexity. It would run its OODA loop a number of occasions to resolve a single problem—by checking logs, verifying dependencies, and confirming guidelines earlier than it ever makes a change. This means to attach reasoning steps permits it to deal with the dimensions and uncertainty that breaks conventional automation.
However the sample we established, agentic OODA Loops through MCP, is common. As soon as we proved the mannequin labored for Pods and Companies, we realized we may lengthen it. For legacy infrastructure, we may give the agent instruments to SSH into Linux VMs. For SaaS administration, we may give it entry to Salesforce or GitHub APIs. For cloud governance, we are able to ask it to audit AWS Safety Teams. The great thing about this structure is that the “mind” (the Codex CLI) stays the identical. To assist a brand new setting, we don’t must rewrite the engine; we simply hand it a brand new set of instruments. Nevertheless, shifting to an agentic mannequin forces us to confront new trade-offs. Essentially the most quick is value versus context. We discovered the exhausting approach that you just shouldn’t give the AI the uncooked YAML of a Kubernetes deployment—it consumes too many tokens and distracts the mannequin with irrelevant particulars. As an alternative, you create a instrument that returns a digest—a simplified JSON object with solely the fields that matter. That is context optimization, and it’s the key to operating brokers cost-effectively.
Conclusion: The Human within the Cockpit
There’s a concern that AI will change the DevOps engineer. Our expertise with the Codex CLI suggests the alternative. This know-how doesn’t take away the human; it elevates them. It promotes the engineer from a “script author” to a “mission commander.” The stale CMDB was by no means actually an information drawback; it was a labor drawback. It was merely an excessive amount of work for people to manually monitor and too advanced for easy scripts to automate. By introducing an agent that may cause, we lastly have a mechanism able to maintaining with the cloud.
We began with a small Kubernetes cluster. However the vacation spot is an infrastructure that’s self-documenting, self-healing, and basically intelligible. The period of the brittle sync script is over. The period of infrastructure as intent has begun!
