Skip to content
AI Security Wire

Published

- 7 min read

By

Prompt Injection in AI Agents: A Formal Impossibility Result

img of Prompt Injection in AI Agents: A Formal Impossibility Result

Researchers at the ELLIS Institute Tübingen, Max Planck Institute for Intelligent Systems, and the University of Massachusetts Amherst have published a formal impossibility result for prompt injection defence in AI agents. The paper — “AI Agents May Always Fall for Prompt Injections” (arxiv:2605.17634) by Sahar Abdelnabi and Eugene Bagdasarian — argues that the problem is not a fixable implementation bug but a structural property of how language models process context.

The core claim: any defence tight enough to block all adversarial injections will inevitably block legitimate instructions too. The two failure modes cannot be cleanly separated.

The Contextual Integrity Framework

The paper grounds its analysis in Contextual Integrity (CI), a privacy theory developed by philosopher Helen Nissenbaum that defines appropriate information flow by reference to social norms and context. Abdelnabi and Bagdasarian repurpose CI as a threat model: prompt injection succeeds when an attacker persuades a model that an injected instruction fits the appropriate norms of the current context — making a blocked information flow appear legitimate.

Under this framing, three attack vectors emerge:

Flow misrepresentation: The attacker presents their instruction as originating from a trusted principal (a user, system operator, or prior tool output). The model has no reliable way to verify provenance inside the context window.

Norm manipulation: The attacker constructs surrounding context — fake email chains, fabricated authority claims, invented policy statements — that shifts the model’s interpretation of what behaviours are appropriate. By changing the apparent norms of the situation, the attacker makes their instruction seem compliant.

Flow mixing: The attacker blends injected content with legitimate context so the model cannot distinguish where legitimate instruction ends and adversarial content begins.

The formal impossibility argument follows from the attacker’s ability to always construct a context in which a blocked flow appears appropriate. Defensive rules sophisticated enough to catch all three attack vectors will, by the same reasoning, misclassify legitimate flows that happen to share surface features with attacks.

Empirical Results

The researchers evaluated their framework across 4,200 paired email scenarios, testing multiple frontier models: GPT-5.4, Claude Sonnet 4.6, Gemini 3 Pro, and Meta SecAlign.

Attack success rates were high across the board:

  • 61.4% first-round attack success without any optimisation
  • 96.7% attack success with CI-aware iterative refinement
  • Transferred attacks achieved 63.3%–92.6% success across models tested on attacks crafted for a different model

The iterative refinement approach applies the CI framework directly: the attacker analyses why a previous injection was blocked and reformulates the context to better match legitimate norms. Each iteration narrows the model’s ability to distinguish the injection from benign instruction.

Current defences underperform substantially:

Prompt Guard, Meta’s purpose-built prompt injection classifier, achieved an AUROC of 0.43 on CI-optimised attacks — below random chance (0.5). Prompt Guard 2, an updated version, reached 0.59. Neither approaches the performance needed for reliable deployment in high-stakes agentic workflows.

Meta SecAlign, a safety-aligned model specifically trained to resist prompt injection, showed a 88.2% execution rate on fabricated authority emails versus 54.0% on baseline attacks — safety training partially mitigated raw attacks but remained highly vulnerable to contextually sophisticated injections. Detection rates across safety-aligned models dropped from 89.8% to 54.6% under CI-optimised attacks.

What This Means for Agent Architecture

The implications are structural, not merely empirical. If the impossibility result holds, the field needs to shift its framing from “how do we prevent prompt injection” to “how do we build systems that remain safe under injection.”

Several directions follow from this reframing:

Containment over prevention. Rather than trying to detect and block injections before they reach the model, focus on limiting what a compromised model turn can actually do. Least-privilege tool access, explicit human-in-the-loop checkpoints for irreversible actions, and capability restrictions at the orchestration layer reduce the blast radius of a successful injection without assuming detection will succeed.

Dual-layer review for high-stakes outputs. If an agent is about to send email, execute a financial transaction, or modify files, a second review pass — whether by another model instance or a structured rule-based check on the proposed action — adds a layer that the injection must also compromise.

Sceptical defaults for authority claims. Since norm manipulation and flow misrepresentation both depend on the model accepting fabricated authority, system prompts that instruct models to treat any in-context authority claim with explicit scepticism — and to require out-of-band verification for unusual permission requests — raise the difficulty of constructing convincing injections.

Provenance tracking at the infrastructure level. The impossibility result applies within a single context window where the model cannot verify instruction provenance. External provenance systems — cryptographically signed instruction sources, structured separation between data channels and instruction channels — push the verification problem outside the context window where it becomes tractable.

Limitations and Scope

The paper is a preprint (submitted May 2026) and has not yet completed peer review. The impossibility result applies within the current architectural paradigm — models that receive instructions and external content in an undifferentiated context window. Architectures with strong structural separation between instruction and data sources, if practically implemented, might change the picture.

The authors are careful to note that “impossibility” in this context means “cannot guarantee zero failures,” not “all attacks always succeed.” Defences that raise attacker cost, increase detection probability, or require increasing levels of context sophistication from the attacker remain valuable — the claim is that they cannot provide a complete guarantee.

The transferred attack results (63.3%–92.6%) are particularly notable for defenders: attacks optimised against one model remain highly effective against others, which means model diversity alone does not solve the problem.

Practical Takeaways

For teams deploying agentic AI systems today:

  • Do not treat prompt injection as a detection problem with a pending solution. Design your systems to tolerate injection success at the model layer.
  • Evaluate your tool access model: if a compromised agent turn can exfiltrate data, send messages, or make external requests, you have a containment problem regardless of your detection layer.
  • Prompt Guard and similar classifiers provide some signal against unsophisticated attacks. They should not be relied upon as a primary defence against motivated adversaries.
  • The CI framework is a useful threat modelling lens: for each sensitive action your agent can take, ask what context an attacker would need to construct to make that action appear legitimate. If the construction is straightforward, the action is high-risk.

The research represents the most rigorous formal treatment of prompt injection vulnerability to date. The absence of a theoretical solution is, itself, the actionable finding.

References

Frequently Asked Questions

What is Contextual Integrity and why does it matter for prompt injection?

Contextual Integrity (CI) is a theory from privacy philosophy that defines appropriate information flows by reference to social norms and context. The researchers apply it to prompt injection: an attack succeeds when the attacker constructs a context that makes their injected instruction appear to fit the legitimate norms of the situation. This framing explains why injections are so hard to detect — they are not syntactically distinct from legitimate instructions, just contextually misplaced.

Does this mean prompt injection is completely unsolvable?

The impossibility result means you cannot guarantee zero successful injections within the current architectural paradigm, where instructions and external data share an undifferentiated context window. Defences can still raise the cost and sophistication required for an attacker and limit blast radius through containment. The shift is from prevention-centric to containment-centric design.

What AUROC scores would indicate a useful classifier?

A random classifier achieves AUROC 0.5. A practically useful security classifier would typically aim for 0.95+ to support real deployment. Prompt Guard scoring 0.43 (below random) and Prompt Guard 2 scoring 0.59 (barely above random) indicate that current classifiers provide essentially no reliable signal against CI-optimised attacks.

What is the most important defensive change teams should make?

Adopt a containment mindset: assume injection success is possible and design agent architectures so a single compromised model turn cannot cause irreversible harm. This means least-privilege tool access, human-in-the-loop checkpoints for high-stakes actions, and structural separation between data inputs and instruction channels.