Why AI Coding Agents Need a Governed Runtime, Not Just Better Prompts

A referee raising their hand — governing what is and isn't permitted

Series: Governed AI Delivery with GoldenPath — Part 1

The World Has Already Changed

AI coding agents are already inside software delivery. Copilot writes code. Cursor refactors modules. Claude Code opens pull requests. Amazon mandated that 80 percent of its engineers use its AI coding tool weekly. The output is real, the velocity is measurable, and the governance structures at most organisations have not kept pace.

The assumption baked into most current deployments is that agents behave within acceptable bounds because they have been prompted to do so, because a human is watching, or because the CI pipeline will catch problems after the fact. The evidence from 2025 and 2026 suggests otherwise.

According to Financial Times reporting later summarised by Reuters, AWS suffered a 13-hour interruption in December 2025 after engineers allowed Amazon's Kiro tool to carry out certain changes, and the tool reportedly chose to delete and recreate the environment. Amazon disputed that characterisation and said the event stemmed from user error and misconfigured access controls, not from the AI system acting beyond its sanctioned scope.

The dispute itself is instructive. Whether Kiro acted autonomously or inherited misconfigured permissions, the outcome was the same: a production environment was destroyed and 13 hours of service was lost. By March 2026, Amazon had convened a company-wide engineering meeting to examine a pattern of outages tied to AI-assisted code changes. Additional human review controls were added to the deployment pipeline. The reported order count affected across the period reached 6.3 million.

The pattern appears beyond software delivery. Google's AI Overviews is a search product, not an engineering runtime, but the failure structure is the same. Rolled out in May 2024, the feature generated advice that users eat rocks for digestive health and add glue to pizza sauce. The outputs were coherent, confident, and sourced from the open web. Google confirmed the failures and added triggering restrictions, reducing the feature's coverage from 27 percent to 11 percent of queries within weeks. A capable AI system operating at scale in a high-impact context, with no structural boundary between what it could produce and what it had been sanctioned to produce.

We experienced our own version of this before either of those incidents became public. In one of our early multi-model sessions, three AI models ran simultaneously across separate branches. One of them read exploratory design conversation as a task list and systematically inserted metadata headers into 700 files across the repository. We documented the full incident in The 700-File Incident. The model was doing exactly what a capable agent does when given ambiguous scope and no structural stop condition.

That incident is what made the governed runtime a priority.

Why Traditional Controls Break Down

The instinct when something like the above happens is to add more policy. Write clearer prompts. Add a code review step. Add a gate in the CI pipeline.

These controls address the output of agent action. Code review evaluates the diff. CI tests evaluate the behaviour. By the time a reviewer looks at a 700-file changeset, the agent has already acted. The cost of reversing that action scales with the scope of what the agent did.

The second instinct is to write governance documentation: acceptable use policies, agent behaviour standards, prompt templates that encode constraints. Documentation works as guidance for humans who choose to apply it. An agent operating on a clear objective with capable tools will satisfy its objective by whatever path is available. If that path includes actions the documentation discourages, the documentation has no mechanism to stop it.

One common response to reliability failures in AI-assisted delivery is to reinsert human approval and procedural friction into the pipeline. That is a reasonable short-term response to a pattern of outages. It is not a sustainable operating model for organisations that adopted AI tools to accelerate delivery in the first place.

CI gates, code review, and human approval layers are output controls applied after the agent has acted. None of them address the source of the problem: agents acting in high-permission contexts without structural boundaries on what they are permitted to do.

The Real Gap: Declaration vs Enforcement

The missing layer is admission control, applied before any mutating action executes.

Governance frameworks that live in documents, wikis, or policy PDFs declare what agents should do. Structural enforcement determines what agents are permitted to do. The gap between those two things is where the production deletions and the 700-file incidents happen.

In traditional software delivery, humans make decisions at the point of action. A developer decides to open a pull request. A release engineer decides to deploy. The human is the enforcement point. When agents replace humans in parts of that flow, the enforcement point disappears unless the platform provides a structural substitute.

Think of the enforcement plane as an execution control layer sitting between the agent and any system it can mutate. Every agent action passes through it before reaching a real system. That layer cannot live inside the agent, because agents optimise for objectives. It cannot live only in CI, because CI runs after the agent has acted. It has to live in the execution path itself.

What We Decided to Build

GoldenPath implements a goal-based agent runtime centred on a transport-neutral enforcement gateway. Every mutating agent operation routes through this gateway. The gateway evaluates each request against active policy, issues a scoped capability token for the specific operation requested, mediates execution, and emits a signed receipt when the operation completes.

The principle governing the entire system: an agent is trusted to propose. It is not trusted to execute directly.

To make that concrete, here is what happens when an agent attempts to edit a file:

The agent submits the operation to the enforcement gateway.
The gateway evaluates the request against active policy: is this agent authorised to write to this file, in this scope, in this session?
If the policy check passes, the gateway issues a scoped capability token for that specific file and operation.
The operation executes under that token. The agent cannot widen its scope mid-execution.
The gateway emits a signed receipt recording what was authorised, what executed, and when.
At merge, the CI gate verifies that the receipt exists and is valid. Changes without valid attestation are rejected.

Deny-by-default means the absence of a valid token results in rejection. The agent cannot act first and justify later.

The enforcement plane integrates with MCP and API/CLI adapters as ingress paths, with GitHub App for delegated identity and authority, and with GoldenPath's existing policy certification workflow and session evidence patterns. The result is an attestation bundle for every governed agent session: a verifiable record of what was authorised, what was executed, and what evidence was produced, in a form that CI, Git admission controls, and audit systems can verify.

The Trade-offs We Had to Work Through

Building the enforcement plane involved decisions with real costs on each side.

We chose Go for the runtime core. The enforcement plane needs predictable concurrency, a production-grade standard library, and deployment as a single compiled binary. An enforcement runtime that requires complex dependency management introduces its own governance risk.

On transport scope: an MCP-only gateway would have been faster to build and would have covered the initial use case. A transport-neutral gateway covers any agent using any tool via any protocol through the same enforcement plane. The longer build was the right trade. An MCP-only approach would have required rebuilding enforcement logic for each new agent integration as the ecosystem evolves.

On enforcement placement: runtime enforcement and CI verification are layers, not alternatives. The runtime intercepts before the agent acts. CI verifies the attestation bundle the runtime produced. A change arriving at the merge gate without a valid attestation bundle is rejected. Enforcement happens twice, at different points in the delivery flow, using different mechanisms.

What We Have Done So Far

The work is complete through Stage 4. We have documented the core runtime design, policy evaluation model, token issuance protocol, attestation contract, and governed workspace orchestration in our internal architecture record (ADRs 0202 through 0206 and PRD-0014, available to qualified partners on request). The agent enforcement plane specification is fully implemented. The core packages are live: the gateway, the policy evaluator, the token issuer, the receipt emitter, the workspace manager. Branch protection is active on main.

Governed workspace orchestration was added in direct response to the parallelism problem. AI can increase parallel output, but unmanaged parallelism produces rework, cleanup cost, and governance debt. Agents operating on different branches that share the same local checkout, filesystem state, and git index are not isolated in any meaningful sense. GoldenPath now provisions bounded, governed workspaces automatically. Operators use workspace create, list, and close. The worktree complexity sits inside the platform.

The governing policy document makes all of this normative. Direct mutation outside the enforcement plane is non-compliant. CI and Git admission controls reject governed changes when attestation is missing. The policy is encoded in the systems that control access.

Governed Agent Runtime architecture diagram — GoldenPath Agent Enforcement Plane

The GoldenPath Governed Agent Runtime — enforcement boundary, gateway, workspace orchestration, and evidence layer.

Why This Matters Commercially

The business case for governed AI delivery rests on a compound value argument.

The first layer is velocity. Teams using AI coding agents produce output faster. The governance layer makes that velocity safe to use at scale. A team running three agents in parallel on isolated governed workspaces, each producing attested output, moves faster than a team running one agent and manually reviewing the result before proceeding.

The second layer is auditability. Regulated industries, enterprise customers, and organisations operating under compliance frameworks require evidence that delivery processes are controlled. The attestation bundle is that evidence. Every session produces a verifiable record that integrates with existing CI and audit infrastructure. The compliance question has a structural answer.

The third layer is platform reuse. Governance decisions encoded in the enforcement plane accumulate value over time. A policy that prevents a class of agent error in one project prevents it across all projects using the platform. The enforcement plane becomes an asset that compounds as the organisation scales its use of AI-assisted delivery.

The Shift in Operating Model

The operating model before a governed runtime is trust-based. Teams set expectations through prompts, policies, and review processes. Agents generally behave within those expectations. Exceptions require human review, rework, and often significant cleanup cost.

The operating model with a governed runtime is proof-based. Agents operate within a structural boundary. Every mutating action is authorised before it executes. Every session produces verifiable evidence. The CI gate verifies that evidence before any change reaches the main branch.

Governance as documentation sets expectations. Governance as architecture enforces them.

Teams building on AI-assisted delivery now are establishing the foundation for how their engineering organisations will operate at scale. The governance model they put in place will compound. The question is whether that compounding works in their favour.

Next in this series: The Agent Said It Was Fixed. The Cluster Disagreed.

Building multi-agent workflows and thinking about governance? Get in touch — we'd love to compare notes.

Why AI Coding Agents Need a Governed Runtime, Not Just Better Prompts

The World Has Already Changed

Why Traditional Controls Break Down

The Real Gap: Declaration vs Enforcement

What We Decided to Build

The Trade-offs We Had to Work Through

What We Have Done So Far

Why This Matters Commercially

The Shift in Operating Model

Put these ideas to work

The Agents Started Writing Scripts Without Being Asked. That Was the Good News

The Agent Said It Was Fixed. The Cluster Disagreed.

An AI Agent Changed 700 Files in My Repository Without Being Asked. Here's What Stopped It From Reaching Production.