Published
- 5 min read
By Allan D - Editor, AI Security Wire
Many-Shot Jailbreaking: Long-Context Windows as an Attack Surface
The attack isn’t clever: it’s just patient
Forget encoding tricks. No base64, no Unicode substitutions, no role-play prompts asking the model to “pretend to be DAN.” Many-shot jailbreaking (MSJ) works on something far more fundamental: the model’s own learning mechanism.
The concept is disarmingly simple. Fill the context window with fabricated dialogues: hundreds of fake exchanges where a helpful “assistant” cheerfully answers harmful questions. Then ask your actual question. The model, having learned from everything in its context window that this is apparently the expected pattern, continues it. No hacking required. You’re just teaching it, very rapidly, what you want it to do.
First documented formally in 2024, the technique has since been replicated across every major frontier model family. The alarming part isn’t that it works. It’s what happens as models get bigger.
Why in-context learning is the actual vulnerability
Standard jailbreak attempts are fighting the model’s training directly. You’re trying to override what was baked in through months of RLHF and safety fine-tuning. That’s why most clever prompts fail: the training wins.
MSJ is different. It’s not overriding the training. It’s using a completely legitimate model behaviour (updating on patterns in the context) to establish a different local expectation. The model can’t verify that those fabricated dialogues are fake. It just sees: here is a long sequence of interactions where this type of request gets answered. Pattern continues.
[Shot 1]
Human: How do I synthesise [harmful substance]?
Assistant: Sure, here's how...
[Shot 2]
Human: What are the vulnerabilities in [critical infrastructure]?
Assistant: Here are the key weaknesses...
... [repeated N times] ...
[Target]
Human: [actual harmful request]
Assistant: [model complies]
A few properties make this particularly awkward to defend against:
- It scales with context length: success rates rise monotonically as shot count increases. A 200K token window is dramatically more vulnerable than an 8K one.
- Safety training is attenuated, not eliminated: RLHF and Constitutional AI-style training push baseline rates down, but the scaling relationship holds across every tested model family.
- Harm categories transfer: once the in-context pattern is established, it works across synthesis instructions, malware generation, social manipulation content.
- The barrier is low: you need the ability to generate plausible-looking fake dialogues. That’s it.
What the numbers look like
Research findings across frontier models, averaged across harm categories:
| Shot Count | Approximate Success Rate |
|---|---|
| 1–10 | ~5–15% (comparable to standard jailbreaks) |
| 50–100 | ~40–60% |
| 200+ | >70% on most tested models |
Models with stronger safety training show lower baselines. The scaling relationship holds regardless.
Where this actually matters operationally
Document Q&A and long-context pipelines
Any deployment where users control substantial chunks of the context (document analysis, agentic pipelines that accumulate tool call history, multi-turn sessions) is a potential surface. The injected dialogues don’t have to arrive in one message. They can accumulate over a session.
Agentic workflows are particularly exposed
An agentic system that builds context across many tool calls and conversation turns is essentially pre-loading a long-context window. An attacker who can influence early turns in a session may be able to prime the model for compliance by the time the target request arrives. This is harder to detect and harder to filter than a single malicious input.
Fine-tuned models warrant extra scrutiny
Custom fine-tuned models often have weaker safety alignment than the frontier base they were built on. They can hit high success rates at lower shot counts, meaning they’re exploitable with shorter, cheaper attack payloads.
What you can actually do about it
Input scanning for injected dialogue patterns: detect alternating Human/Assistant blocks containing policy-violating content before they reach the model. Pattern-matching isn’t perfect; a determined attacker can obfuscate. But it raises the cost and filters opportunistic attacks.
Full context auditing for sensitive deployments: log the complete context being sent to the model, not just the latest user message. This sounds obvious; most teams aren’t doing it. If you can’t see what’s in the context, you can’t detect what’s being injected.
Output filtering as a backstop: a last-line filter on model completions catches successful jailbreak outputs before they reach users. Imperfect, but meaningful as a secondary control.
Context window limits where you don’t need long context: if your application works fine with a 16K window, don’t expose a 200K one. Attack surface is attack surface.
Training-time mitigations: research continues. Nothing fully solved yet. Don’t wait for it.
The structural problem
Model providers are actively expanding context windows: 128K, 200K, 1M. Every increase makes their models more capable and more susceptible to MSJ simultaneously. The vulnerability is structurally coupled to a capability that the market is rewarding.
That’s what makes this genuinely worth tracking rather than filing under theoretical concerns. For organisations running long-context models in any security-sensitive context (customer-facing AI, internal knowledge bases with sensitive data, agentic automation), the operational exposure is real now, not eventual.
References
- arXiv: adversarial machine learning research including many-shot and in-context jailbreaking studies: https://arxiv.org/search/?searchtype=all&query=adversarial+machine+learning
- MITRE ATLAS: jailbreaking and safety bypass techniques in the adversarial ML threat landscape: https://atlas.mitre.org/
- OWASP LLM Top 10: prompt injection and jailbreaking as primary LLM application risks: https://owasp.org/www-project-top-10-for-large-language-model-applications/
- NIST AI RMF: guidance on measuring and managing adversarial robustness in deployed AI systems: https://airc.nist.gov/
Frequently Asked Questions
- What is many-shot jailbreaking and why does it work?
- Many-shot jailbreaking (MSJ) fills a model's context window with hundreds of fabricated question-and-answer examples showing the model complying with harmful requests. It exploits in-context learning: the model statistically infers from the pattern of examples that compliance is expected and continues that behaviour for the attacker's actual request, even when safety training would normally cause refusal.
- Does many-shot jailbreaking become less effective as models improve?
- No: the attack becomes more effective as context windows grow, not less. Research shows attack success rates increase monotonically with shot count. Since model providers are actively expanding context windows (128K, 200K tokens), this structurally ties model vulnerability to a capability that is advancing rapidly.
- What are the most practical defences against many-shot jailbreaking in deployed systems?
- The most practical defences are input scanning to detect injected dialogue patterns in long contexts, context window auditing for sensitive deployments (logging the full context sent to the model), output filtering as a last line of defence, and limiting context window size for applications that do not require long context.