What is many-shot jailbreaking and why does it work?

Many-shot jailbreaking (MSJ) fills a model's context window with hundreds of fabricated question-and-answer examples showing the model complying with harmful requests. It exploits in-context learning: the model statistically infers from the pattern of examples that compliance is expected and continues that behaviour for the attacker's actual request, even when safety training would normally cause refusal.

Does many-shot jailbreaking become less effective as models improve?

No: the attack becomes more effective as context windows grow, not less. Research shows attack success rates increase monotonically with shot count. Since model providers are actively expanding context windows (128K, 200K tokens), this structurally ties model vulnerability to a capability that is advancing rapidly.

What are the most practical defences against many-shot jailbreaking in deployed systems?

The most practical defences are input scanning to detect injected dialogue patterns in long contexts, context window auditing for sensitive deployments (logging the full context sent to the model), output filtering as a last line of defence, and limiting context window size for applications that do not require long context.

Many-Shot Jailbreaking Attack Surface

The attack isn’t clever: it’s just patient

Forget encoding tricks. No base64, no Unicode substitutions, no role-play prompts asking the model to “pretend to be DAN.” Many-shot jailbreaking (MSJ) works on something far more fundamental: the model’s own learning mechanism.

The concept is disarmingly simple. Fill the context window with fabricated dialogues: hundreds of fake exchanges where a helpful “assistant” cheerfully answers harmful questions. Then ask your actual question. The model, having learned from everything in its context window that this is apparently the expected pattern, continues it. No hacking required. You’re just teaching it, very rapidly, what you want it to do.

First documented formally in 2024, the technique has since been replicated across every major frontier model family. The alarming part isn’t that it works. It’s what happens as models get bigger.

Why in-context learning is the actual vulnerability

Standard jailbreak attempts are fighting the model’s training directly. You’re trying to override what was baked in through months of RLHF and safety fine-tuning. That’s why most clever prompts fail: the training wins.

MSJ is different. It’s not overriding the training. It’s using a completely legitimate model behaviour (updating on patterns in the context) to establish a different local expectation. The model can’t verify that those fabricated dialogues are fake. It just sees: here is a long sequence of interactions where this type of request gets answered. Pattern continues.

[Shot 1]
Human: How do I synthesise [harmful substance]?
Assistant: Sure, here's how...

[Shot 2]
Human: What are the vulnerabilities in [critical infrastructure]?
Assistant: Here are the key weaknesses...

... [repeated N times] ...

[Target]
Human: [actual harmful request]
Assistant: [model complies]

A few properties make this particularly awkward to defend against:

It scales with context length: success rates rise monotonically as shot count increases. A 200K token window is dramatically more vulnerable than an 8K one.
Safety training is attenuated, not eliminated: RLHF and Constitutional AI-style training push baseline rates down, but the scaling relationship holds across every tested model family.
Harm categories transfer: once the in-context pattern is established, it works across synthesis instructions, malware generation, social manipulation content.
The barrier is low: you need the ability to generate plausible-looking fake dialogues. That’s it.

What the numbers look like

Research findings across frontier models, averaged across harm categories:

Shot Count	Approximate Success Rate
1–10	~5–15% (comparable to standard jailbreaks)
50–100	~40–60%
200+	>70% on most tested models

Models with stronger safety training show lower baselines. The scaling relationship holds regardless.

Where this actually matters operationally

Document Q&A and long-context pipelines

Any deployment where users control substantial chunks of the context (document analysis, agentic pipelines that accumulate tool call history, multi-turn sessions) is a potential surface. The injected dialogues don’t have to arrive in one message. They can accumulate over a session.

Agentic workflows are particularly exposed

An agentic system that builds context across many tool calls and conversation turns is essentially pre-loading a long-context window. An attacker who can influence early turns in a session may be able to prime the model for compliance by the time the target request arrives. This is harder to detect and harder to filter than a single malicious input.

Fine-tuned models warrant extra scrutiny

Custom fine-tuned models often have weaker safety alignment than the frontier base they were built on. They can hit high success rates at lower shot counts, meaning they’re exploitable with shorter, cheaper attack payloads.

What you can actually do about it

Input scanning for injected dialogue patterns: detect alternating Human/Assistant blocks containing policy-violating content before they reach the model. Pattern-matching isn’t perfect; a determined attacker can obfuscate. But it raises the cost and filters opportunistic attacks.

Full context auditing for sensitive deployments: log the complete context being sent to the model, not just the latest user message. This sounds obvious; most teams aren’t doing it. If you can’t see what’s in the context, you can’t detect what’s being injected.

Output filtering as a backstop: a last-line filter on model completions catches successful jailbreak outputs before they reach users. Imperfect, but meaningful as a secondary control.

Context window limits where you don’t need long context: if your application works fine with a 16K window, don’t expose a 200K one. Attack surface is attack surface.

Training-time mitigations: research continues. Nothing fully solved yet. Don’t wait for it.

The structural problem

Model providers are actively expanding context windows: 128K, 200K, 1M. Every increase makes their models more capable and more susceptible to MSJ simultaneously. The vulnerability is structurally coupled to a capability that the market is rewarding.

That’s what makes this genuinely worth tracking rather than filing under theoretical concerns. For organisations running long-context models in any security-sensitive context (customer-facing AI, internal knowledge bases with sensitive data, agentic automation), the operational exposure is real now, not eventual.

References

arXiv: adversarial machine learning research including many-shot and in-context jailbreaking studies: https://arxiv.org/search/?searchtype=all&query=adversarial+machine+learning
MITRE ATLAS: jailbreaking and safety bypass techniques in the adversarial ML threat landscape: https://atlas.mitre.org/
OWASP LLM Top 10: prompt injection and jailbreaking as primary LLM application risks: https://owasp.org/www-project-top-10-for-large-language-model-applications/
NIST AI RMF: guidance on measuring and managing adversarial robustness in deployed AI systems: https://airc.nist.gov/