A Pinocchio puppet — the agent said it was fixed, but it wasn't

Photo by Jametlene Reskp on Unsplash

Series: What Actually Happened — Real incidents from building a governed AI-native platform


This is the second incident in this series. The first covered an agent that modified 700 files without being asked — and what structural controls stopped it from reaching production.


The Failure That Kept Returning

It started with an assumption I hadn't examined.

A persistent cluster was running. Platform installed, applications deployed, ingress configured, DNS resolving through Route 53, Backstage accessible. The persistent cluster was the proof of concept. We knew the setup worked. The next task was making it reproducible: spin up an ephemeral cluster on demand, run the full stack from cold, tear it down when done.

The persistent cluster proved it was possible. The ephemeral cluster proved that possible and reproducible are two different things.

The first failure was expected. Ephemeral clusters are hard. We were working with an AI agent to configure the ingress controller, DNS routing through Route 53, and Backstage. Problems emerged. The agent identified them, applied fixes to the running cluster, confirmed they were working, and committed the changes to the repo. CI passed. PRs merged. At the end of the session, with every indicator green, the agent confirmed: everything is fixed. It should work on the next deploy.

The next deploy failed.

The same symptoms. The same configuration errors. The same ingress failures. We went back to the agent. It fixed them again, on the running cluster. Confirmed working. Committed. CI passed. Same result on the next cold start.

By the third cycle the pattern was clear, and it was costing real time. The changes were in the repo. The PRs had merged. The tests had passed. But every cold start produced the same failures.

The system was non-deterministic. Non-deterministic infrastructure, at any speed, is infrastructure you cannot trust.


Why "Fixed" Meant Two Different Things

The agent was producing fixes that only worked once.

Each fix was applied to a cluster with accumulated state: existing ConfigMaps, running controllers, mounted secrets, DNS entries that had already propagated. In that context, every fix worked. The ingress controller was already running. The DNS was already resolving. Adding a configuration change to a system that had already bootstrapped itself made the change look permanent.

The ephemeral cluster starts from zero. No accumulated state. No bootstrapped components. No prior DNS propagation. The fix that corrected a running system hit a dependency on a cold start that didn't yet exist: a component that hadn't initialised, a resource that wasn't present, a sequence that hadn't run. The error returned because the fix assumed a world that a cold cluster doesn't provide.

An AI agent optimises for the task in front of it. If the task is "fix the ingress error," the task ends when the ingress error is gone. Whether that fix holds under every future condition is outside the scope of the task as stated. Determinism is a property the platform has to require.


What We Tried First: Prompt Templates

The first instinct was to solve the problem with better instructions. We introduced structured prompt templates: a TDD-oriented sequence that the agent would follow at the start of each session. Write the test first. Then write the fix. Commit both together. The template was designed to make the agent's process deterministic by making it explicit.

It improved average behaviour. Outcomes remained inconsistent.

The agent read the prompt template and interpreted it. On some sessions it followed the sequence precisely. On others it decided the test was straightforward and moved directly to the fix, writing the test afterward or skipping it when the fix appeared obvious. The template was a declaration. The agent treated it as guidance, applied it where it seemed relevant, and made judgement calls about where it did not.

A prompt template in a markdown file is advisory. It tells the agent what it should do. It has no mechanism to enforce what it must do. The agent closes the session with the fix committed and the indicators green. Whether the process was followed is invisible to anything downstream.

Optimising the declaration improved the average. Optimising enforcement changed the outcome.


What Actually Delivered Determinism

The shift came when we moved the requirement from the instructions into the pipeline.

Bootstrap sequence tests for every component from cold start replaced the assumption that fixes applied to a running cluster would hold. Each test exercises the full initialisation path in the correct dependency order: spin up, configure, verify. Dependencies are declared explicitly. The test suite proves the system reaches a known state from a known starting point, every time.

Idempotent bootstrap scripts with sequentially explicit dependencies replaced accumulated-state assumptions. An idempotent script produces the same result whether it runs on a fresh cluster or one that is already partially configured. That property is what makes outcomes deterministic across environments.

The session closing protocol changed. Agents now commit a test alongside every fix, in the same PR as the change. The test must pass on a cold cluster. A fix without a paired cold-start test is a hypothesis.

Ephemeral clusters now spin up, run the full bootstrap sequence, pass integration tests, and tear down on every PR that touches infrastructure. The CI gate passes on a demonstrated cold start. Determinism is verified at the merge gate.

The prompt template told the agent what it should do. The CI gate defines what the pipeline requires before a merge proceeds. One is a process the agent can interpret. The other is a structural requirement enforced by the pipeline.

The principle that came out of this: a change is complete when there is a mechanism to prove it is permanent.


From Declared to Structural

This distinction — between declaring a process and enforcing it structurally — is the thread connecting this incident to the broader platform direction.

Prompt templates set expectations. They improve average behaviour. An agent that follows the template produces deterministic results. An agent that interprets the template produces results that depend on how it interpreted it.

The CI gate was the first structural enforcement point for this class of problem. The merge gate requires a cold-start test. The agent must provide proof that the fix holds before a merge proceeds. That requirement lives in the pipeline, not in a document the agent reads.

The next step is moving enforcement earlier: from the merge gate to the execution path itself. When the platform can require, at the point of agent action, that every fix is accompanied by proof that it holds under every starting condition, the feedback loop collapses entirely. The platform verifies determinism at execution time. The agent produces fixes with their proof attached.

That is the governed runtime. This incident is where the need for it became concrete.


What Changed on This Platform

Bootstrap sequence tests are now part of the standard test run. Ephemeral cluster spin-up is part of CI. The session closing protocol is written into the agent's operating contract.

The failure mode in the 700-file incident was scope: an agent acting beyond its sanctioned boundary. The failure mode here was determinism: an agent producing fixes that worked in one context and failed in another. In both cases, the resolution followed the same path — from process documentation to structural enforcement.

The agent optimises for the immediate task. The platform has to extend that definition. A fix is complete when it is proven to be deterministic. That proof lives in the test suite. The test suite lives in the merge gate. And the merge gate is where declarations become requirements.


This is part of a series documenting real incidents from building GoldenPath, an AI-native internal developer platform. Each post covers what actually went wrong and what was built to address it.

Previously: An AI Agent Changed 700 Files in My Repository Without Being Asked.


Building multi-agent workflows and thinking about governance? Get in touch — we'd love to compare notes.