Giving agents hands

Guardrails for agents in production

A catalogue of the guards I actually ship: typed confirmations, blast-radius escalation, pay-to-play gating, ordered workflows, and read-only by default.

18 Jun 2026 · 3 min

The scary moment with an agent is not when it fails. A failure throws an error and stops. The scary moment is when it succeeds at the wrong thing, confidently, against production, because nothing was in the way. Every guard below exists because I pictured that moment and built something to stand in front of it.

This is a catalogue, not a manifesto. Each one is in a server I run.

Typed confirmation literals

A boolean confirm flag is no guard at all. The model fills booleans in by pattern. So the destructive tools do not take confirm: boolean. They take a literal. Dropping a database requires confirm: z.literal("yes"). The model cannot satisfy that by guessing a shape; it has to produce the exact string, which means it has to have read the description that told it the string. The type is the speed bump.

Escalate confirmation by blast radius

Not every destructive act is equally destructive, so the password gets harder as the damage grows. In the s3 server, deleting a single object needs confirm: "yes". Deleting a whole prefix, which could be thousands of objects, needs confirm: "yes-delete-all". The longer, uglier literal is deliberate friction sized to the consequence. You cannot fat-finger your way through a prefix delete the way you might through one object. The string you have to type grows with the hole you are about to dig.

Pay-to-play gating

Some calls cost money, and an agent in a loop can spend it faster than you can watch. The image tool wraps that with a gate: the billable, high-quality path only fires when high_fidelity: true is set explicitly. The default path is the free or cheap one. There is no way to reach the expensive call by accident; the agent has to opt into spending, and that opt-in is visible in the call it made. When the bill shows up, the log shows exactly which request chose to pay.

Encode the workflow order in the server

Some operations are only safe in a sequence, and the model does not know the sequence unless you tell it. The dokku server carries its workflow in the server instructions themselves: create the app, add the domain, set the letsencrypt email, set the ports, deploy, and only then enable SSL, because SSL can only succeed after DNS is actually pointing at the box and the first deploy has landed. Enabling letsencrypt as step two looks reasonable to a model reading the tool list cold. It fails every time. So the ordering is not left to inference. It is written down where the agent reads it, as a required sequence, so the obvious-but-wrong path is closed before it is tried.

Read-only by default

This one underwrites all the others. Every server ships read-only. List, get, report, describe: those are on from day one. Create, set, delete: those are off until I have watched the server drive in read mode long enough to trust it. Widening is a decision I make per verb, never a default the server ships with.

Read-only first is what makes the rest cheap. By the time a write verb is enabled, I have already seen how the model reaches for this system, what it tends to get wrong, where it needs a literal and where it needs an ordering. The guard is informed by the watching. You do not design the speed bumps in the abstract. You design them after you have seen the car take the corner too fast.

The shape of all of it

None of these are clever. A literal string, a longer literal string, an explicit cost flag, an ordered list in the instructions, a write switch that stays off. The cleverness is in deciding which act gets which guard, and that judgment comes from running the thing, not from a framework.

An agent with hands is useful precisely because it can act without asking. Guards are how you decide, in advance, the small set of acts where it has to ask anyway. Pick those carefully. Everything else, let it run.

Related
now runningwhisper_scheduleopen