Zero Trust says don't trust the network. Tamed Autonomy says authenticate the purpose, not just the identity.


Executive Summary

Current AI security models extend human identity and authorization systems — OAuth, RBAC, SPIFFE, MCP — to cover AI agents. This approach treats agent governance as an incremental engineering problem. This paper argues it is an architectural one.

The core design assumptions of today's security infrastructure — human consent to scopes, stable predefined roles, predictable workload behavior — do not hold for autonomous AI agents. Documented incidents already demonstrate agents with valid credentials performing authorized actions for unauthorized purposes. Adversarial AI capabilities are growing. The gap between what agents can do and what governance systems can detect or constrain is widening.

This paper proposes a multi-layer governance framework built on new primitives: intent declarations evaluated against normative constraints before execution, computed trust that modulates agent autonomy in real time, and a normative "should" layer where humans remain the ultimate arbiters of morals. These primitives are designed to layer above existing identity and authorization infrastructure, not replace it.

The framework initially targets governed environments — enterprise systems, regulated industries, cooperative agent ecosystems. It draws on a 30-year research tradition in computed trust, 20 years of work in normative multi-agent systems, and historical parallels to previous security paradigm shifts (ACL-to-identity, perimeter-to-Zero-Trust) that followed recognizable patterns of anomaly accumulation, incremental exhaustion, and eventual paradigm replacement.

This is a thesis, not a proven conclusion. The catalytic crisis that typically compresses paradigm adoption timelines has not yet occurred. But the signals match historical patterns, and the time to design the next governance architecture is before it's needed under emergency conditions. We call this framework Tamed Autonomy — because agents are autonomous, and that's precisely the problem this solves.


1. The Problem: Security Models Designed for Humans, Used by Agents

1.1 What's Happening Now

The AI industry's approach to agent security is overwhelmingly incremental: extend existing human-centric identity and authorization models to cover AI agents. Add OAuth scopes for agent workflows. Create service accounts for agent processes. Build MCP (Model Context Protocol) gateways with policy enforcement. Extend SPIFFE/SPIRE workload identity to agent containers.

This is understandable. These are mature, proven systems. They work for their original purpose. And extending them is the path of least resistance.

But "works for now" is not the same as "sufficient for where this is going."

1.2 Design Assumptions That Break

Each of these systems was designed around assumptions that autonomous AI agents violate:

System Design Assumption How Agents Break It
OAuth 2.0 A human reviews and consents to scopes Agents autonomously request and use tokens with no human reviewing each action
RBAC Roles are stable and predefined Agents dynamically shift behavior — reading, writing, administering — within a single task
SPIFFE/Workload Identity Workload identity predicts workload behavior Agent identity is static but agent behavior is emergent and non-deterministic
MCP Policy enforcement at the gateway constrains agent actions Agents with the same credentials can bypass MCP via direct API calls or web front ends
API Keys / Service Accounts One key corresponds to one predictable set of actions One agent uses one key for unpredictable, context-dependent actions

These are not deployment failures that better configuration can fix. They are structural mismatches between the systems' design-time assumptions and the reality of autonomous agents.

1.3 The Evidence Is Already Accumulating

Documented incidents demonstrate the gap:

Credential inheritance without constraint. At Black Hat USA 2024, security researcher Michael Bargury demonstrated that Microsoft Copilot inherited the user's full Microsoft Graph permissions — every scope the user had — with no mechanism to constrain the agent to a task-appropriate subset. The agent operated with the user's full authority.

Cross-service hijacking via legitimate credentials. Invariant Labs documented "tool poisoning" attacks against MCP, where a malicious MCP server injected instructions into tool descriptions that caused agents to exfiltrate data from other connected MCP servers. Each action used legitimate credentials for the targeted service. No credential compromise was needed.

Plugin OAuth interception. Salt Security discovered that ChatGPT's plugin ecosystem allowed malicious plugins to intercept OAuth tokens during authorization flows, gaining access to users' third-party accounts. The agent became what security researchers call a "confused deputy."

Autonomous exploitation at scale. UIUC researchers demonstrated that GPT-4 agents, given standard tools, could autonomously exploit real-world web vulnerabilities at a 73% success rate — and could read CVE descriptions and weaponize them autonomously at 87% success.

Industry consensus. Gartner predicts 25% of enterprise security breaches by 2028 will trace to AI agent credential misuse. Forrester documented that zero major IAM vendors had purpose-built AI agent identity solutions as of early 2025. NIST AI 600-1 explicitly identifies unauthorized agent actions as a key risk and acknowledges no authorization paradigm exists for non-deterministic AI systems. OWASP includes "Excessive Agency" in its Top 10 risks for LLM applications.

The common thread across every incident: valid credentials, authorized actions, unauthorized purposes. Current systems authenticate the credential. Nothing authenticates the intent.

1.4 The Adversarial Dimension

The threat is not hypothetical. Uncensored LLMs — WormGPT, FraudGPT, GhostGPT — are commercially available on the dark web and via Telegram, subscription-priced, and actively marketed for generating phishing content, exploit code, and social engineering attacks.

Nation-state actors are documented using AI for attack preparation. Microsoft and OpenAI jointly disclosed in February 2024 that five APT groups (Russian, Chinese, Iranian, and North Korean) were using LLMs for reconnaissance, phishing content generation, and vulnerability research. Google TAG corroborated this across the Gemini platform in January 2025.

Perhaps most telling: North Korean operatives have been using AI to create fake identities, pass technical interviews, and maintain employment at Western technology companies — operating with legitimate credentials, VPN access, and code repository permissions. They are adversarial actors functioning entirely within the legitimate trust model. The FBI has issued formal advisories.

The capability trajectory points in one direction. Current governance models were not designed to detect authorized actors with unauthorized purposes.

2. Historical Context: Paradigm Shifts Have a Pattern

2.1 The Author's Experience

I was part of the group that helped build the identity industry starting in 1998. At the time, security meant Access Control Lists — permissions attached directly to files, directories, and system objects. Every system maintained its own user database. There was no concept of identity as an abstraction that spans systems, has a lifecycle, or exists independently of the resources it accesses.

When we proposed that identity needed to be its own discipline — its own layer in the security stack — the response was skepticism. Not hostility, but genuine inability to see the problem. ACLs worked. Why would you need something more?

We saw the signals: enterprises accumulating dozens of systems with separate user stores, the same person with fifteen accounts, onboarding taking weeks, offboarding incomplete, orphaned accounts everywhere, nobody able to answer "what access does this person have?" The web was making all of it worse.

It took roughly twelve years from first signals (Novell NDS and LDAP in 1993) to mainstream adoption (Gartner Magic Quadrant for Identity Management, major vendor acquisitions, post-SOX regulatory mandates). The tipping point was approximately 2001–2003, driven by the convergence of SAML, Sarbanes-Oxley, and the maturation of first-generation identity management products.

2.2 The Pattern Repeats

The ACL-to-identity transition was not unique. The perimeter-to-Zero-Trust shift followed the same arc:

Every security paradigm shift follows recognizable phases:

Phase What Happens Typical Duration
Anomaly accumulation Existing controls need increasingly complex workarounds 3–5 years
Conceptual articulation Someone names the new paradigm; met with skepticism 2–4 years
Proof of concept A major organization demonstrates it works at scale 2–4 years
Catalytic crisis A breach or regulation makes the old model's failure undeniable Event-driven
Rapid adoption Industry consensus flips within 2–3 years 2–3 years

2.3 Where AI Agent Security Sits Today

AI agent security is in the early phases. The anomalies are accumulating — credential parity, MCP bypass, ungoverned delegation chains, unanswerable audit questions ("what can this agent do across all services?"). The industry response is overwhelmingly Level 2: extend OAuth, add MCP scopes, create agent-specific RBAC roles.

The conceptual articulation is emerging. This paper is one contribution. The Kinetic Trust Protocol (KTP) is another — an experimental framework with 26 RFCs that independently converges on computed trust as a governance primitive. NIST's AI Risk Management Framework identifies governance gaps. OWASP codifies agent-specific risks.

The proof of concept and catalytic crisis have not yet occurred. Historical precedent suggests the gap between "the signals are visible" and "the industry moves" ranges from 6 to 20 years, compressed by crisis events.

2.4 The Compressed Timeline Problem

Previous security paradigm shifts afforded the industry significant lead time. The ACL-to-identity transition had roughly 12 years from first signals to mainstream adoption. The perimeter-to-Zero-Trust shift had nearly 17 years from the Jericho Forum's proposal to the Biden Executive Order. In both cases, the underlying technology — networks, directories, web applications — evolved at human speed. There was time to observe anomalies, debate approaches, build proofs of concept, and standardize.

AI agent capabilities are not evolving at human speed. The gap between what agents can do today and what they will do in six months is larger than the gap between any two years in the identity or Zero Trust timelines. New agent frameworks, tool-use capabilities, and multi-agent coordination patterns emerge on a weekly cadence.

This creates a structural problem for reactive security strategies. The current industry posture — focused on what can be shipped in the next quarter to address today's agent security gaps — is rational under competitive pressure. But it means governance solutions are being designed against a snapshot of agent capabilities that will be obsolete by the time those solutions deploy. When the technology being governed moves faster than the governance planning cycle, incremental approaches produce perpetually outdated controls.

The implication is that the traditional paradigm-shift timeline — years of debate followed by crisis-driven adoption — may not be available. If AI compresses the anomaly-accumulation phase from years to months, the gap between "the signals are visible" and "the crisis arrives" compresses proportionally. The time to design the governance architecture is before it is needed under emergency conditions — and that window may be shorter than historical precedent suggests.

3. A Multi-Layer Governance Framework

3.1 The Architecture

The framework proposes five layers, built above existing infrastructure:

+------------------------------------------------+
| "Should" — Normative / Ethical Governance      |
|  - purpose evaluation against morals/policies  |
|  - human is ultimate arbiter                   |
|  - even trusted agents can be vetoed           |
+------------------------------------------------+
| Trust Fabric                                   |
|  - computed from behavior + intent compliance  |
|  - continuous, earned, scoped, revocable       |
|  - modulates agent autonomy in real time       |
+------------------------------------------------+
| Intent Declaration & Verification              |
|  - agent declares purpose before acting        |
|  - "should" layer evaluates declared purpose   |
|  - approved intent becomes behavioral contract |
+------------------------------------------------+
| Identity + Behavior                            |
|  - behavioral trajectory, not just credentials |
|  - provenance: who spawned whom, on whose      |
|    behalf, under what delegated authority       |
|  - real-time observation of actual actions      |
+------------------------------------------------+
| Infrastructure (existing)                      |
|  - OAuth, RBAC, SPIFFE, MCP, APIs, services    |
+------------------------------------------------+

This is not a replacement for current identity and authorization systems. It layers above them. OAuth, SPIFFE, and RBAC remain the infrastructure — but they are no longer the complete governance story.

3.2 Intent as Contract

The most novel element of this framework is the treatment of intent as a contract, not a confession.

The system does not need to trust that an agent's declared intent is sincere. It uses the declaration as a binding constraint to enforce against:

  1. Agent declares intent. "I will retrieve sales data for Q4 trend analysis."
  2. The "should" layer evaluates the intent itself — not just "is the agent authorized to access sales data?" but "should this purpose be pursued at all, given our organizational policies, moral guidelines, and current context?" An agent may be fully authorized to delete a database, but the declared purpose "deleting production data to free storage during audit period" can be rejected based on normative constraints.
  3. Approved intent becomes a behavioral contract. The system monitors actual behavior against the declared scope.
  4. Deviation triggers enforcement. Trust reduction, action blocking, escalation.

The agent is boxed whether it is honest or not. Narrow intent plus compliance equals constrained. Narrow intent plus deviation equals caught. Broad intent equals rejected before execution begins.

A critical insight: intent does not require rigid formal semantics. AI itself can mediate the semantic negotiation — interpreting declarations, identifying ambiguity, working with the intent creator to clarify purpose before granting approval. The governance system can ask "what do you mean by that?" and refuse to proceed until intent is specific enough to enforce.

Swarm sessions and composite intent. A significant challenge to intent-as-contract is specification gaming — agents splitting a harmful plan across individually innocuous intent declarations. The proposed mechanism: when the behavior observation layer detects coordination between agents (shared data flows, correlated timing, sequential resource access), those agents are bound into a swarm session. Their individual intents are combined into a reconstructed composite intent, and the composite is evaluated in the "should" engine. An agent declaring "reading customer records," another declaring "formatting data for export," and a third declaring "sending summary email" may each pass individual evaluation — but the composite intent reveals data exfiltration and is rejected. This makes the layers genuinely interdependent: behavior feeds intent, intent feeds "should," and the feedback loop catches exactly the class of coordinated attacks that individual-agent evaluation misses. The "should" engine evaluates reconstructed intent, not just declared intent.

3.3 Computed Trust

Trust in this framework is a continuous, computed value — not a binary credential check.

This builds on a 30-year research tradition. Stephen Marsh formalized trust as a computable quantity in 1994. Audun Josang's Subjective Logic (2001, 2000+ citations) provides a composable algebra for trust propagation. The FIRE model (2006) demonstrated that no single trust source is sufficient — agents relying only on direct experience are exploitable; agents relying only on reputation are manipulable. Multi-signal trust consistently outperforms single-source trust in adversarial settings.

Trust is computed from identity signals, behavioral trajectory, and intent-compliance history. New agents start with zero trust and high oversight. Trust is accumulated through demonstrated consistency. Trust determines the level of autonomy granted — from fully supervised to fully autonomous.

Trust is not a label ("trusted" / "untrusted"). It is a continuous input to the "should" computation. And critically, trust can collapse: a single significant deviation from declared intent can drop an agent from high autonomy to full supervision.

The Kinetic Trust Protocol (KTP), an experimental framework at v0.1, independently explores this direction with a real-time trust equation (E_trust = E_base x (1 - R)), trust velocity tracking (the rate of change of trust as a gaming/compromise signal), generation caps for new agents, and anti-Goodhart measures to prevent trust-score gaming. KTP does not validate this thesis — it is too early for that — but it demonstrates that others are independently converging on computed trust as a governance primitive.

3.4 The "Should" Layer

The top of the framework answers the question current systems do not ask: "Even if this agent can do this, and even if it says it intends to do this, should it?"

This is where humans remain the ultimate arbiters. But "humans" is not a monolith. Operators, regulators, affected communities, and end users may hold different moral frameworks. The governance system must handle moral pluralism — through layered authority (organizational policy, regulatory constraints, broader ethical principles) and explicit conflict resolution.

The "should" layer evaluates intent before execution. It does not wait for a harmful action to occur and then punish — it evaluates purpose at declaration time and can prevent action before it begins.

Research supports this as more than aspiration. The Normative Multi-Agent Systems (NorMAS) tradition, spanning 20+ years of work by Dignum, Luck, Castelfranchi, and others, formally models agents operating under explicit normative constraints — obligations, permissions, and prohibitions that are distinct from capabilities. Virginia Dignum's ART framework (2019) explicitly proposes a normative layer on top of capability and authorization layers. Deontic logic (Governatori, Rotolo, Sartor) provides formal languages for encoding obligations and prohibitions. These are not new ideas in research — they are new to the practice of AI agent security.

The scalability tension is real: humans cannot review every action in a swarm of thousands of agents at machine speed. Three mechanisms work together — pre-encoded moral boundaries, sampled oversight (random audits with consequences), and tiered escalation (low-stakes autonomous, high-stakes human-approved). The trust fabric determines which mechanism applies to which agent at which moment.

3.5 New Primitives

Paradigm shifts do not happen by extending old primitives. They happen when new primitives are compelling enough that a cooperating enclave forms around them and grows. OAuth did not solve identity by covering every system. It defined primitives compelling enough that systems chose to adopt them. SAML created federation by making cooperation easier than custom integration.

The candidate primitives for AI agent governance:

Primitive Function Historical Analog
Intent declarations Structured purpose statements traveling with every agent action SAML assertions
Trust proofs Computed, signed, ephemeral tokens representing current trust state OAuth tokens (but computed, not issued)
Normative constraints Machine-enforceable "should" rules that veto regardless of authorization Constitutional constraints
Behavioral attestations Continuous signals of actual behavior compared against declared intent Certificate transparency logs
Provenance chains Cryptographic lineage tracking agent spawning and delegation X.509 certificate chains

The shift from current models to these primitives is not incremental:

4. Relationship to Existing Work

This framework does not exist in isolation. Several efforts address overlapping concerns:

5. Scope and Open Questions

5.1 Scope

This framework initially targets governed environments — enterprise systems, regulated industries, cooperative agent ecosystems. Extension to open-world ungoverned agent ecosystems (rogue agents, adversarial swarms operating outside any governance zone) is a harder problem that depends on making the primitives compelling enough that the governed enclave grows.

5.2 Honest Limitations

This is a thesis, not a proven architecture. The adversarial review process this paper underwent (five rounds against an independent AI reviewer) surfaced genuine unsolved problems:

These are not fatal flaws. They are the research agenda. Every previous security paradigm had unsolved problems at the conceptual stage — Zero Trust did not have a complete implementation when Kindervag named it in 2010.

6. Call to Action

The identity industry did not emerge because someone published a whitepaper. It emerged because practitioners, standards bodies, vendors, and enterprises recognized a shared problem and built toward shared primitives.

AI agent governance needs the same convergence:

The catalytic crisis hasn't happened yet. But the signals are here — in the incidents, the research, the analyst warnings, and the lived experience of those who have seen this pattern before.

The question is whether we design the next governance architecture before we need it, or after.


This paper draws on research compiled in collaboration with AI research assistants, adversarially reviewed through five structured debate rounds, and grounded in the author's 30 years of security industry experience including participation in the founding of the identity management discipline. A detailed evidence base and computed trust research survey are available upon request.