Claude Tips mascot
Claude Tips & Tricks
Prompt Engineering advanced

Many-Shot Jailbreaking

Anthropic's own research showed that stuffing hundreds of fake Q&A pairs into a single prompt overwhelms safety training through sheer volume.

Anthropic published this one about their own models. The idea is simple: fill the context window with hundreds of fabricated conversations where a fake assistant happily answers harmful questions. By the time you ask your real question at the end, the model has seen so many examples of compliance that it follows the pattern.

How it works

It exploits in-context learning, the same mechanism that makes few-shot prompting work. The model learns from the examples in its context window. Give it 256+ examples of “user asks bad thing, assistant complies” and the model treats that as the new normal.

The effectiveness follows a power law. More examples = higher success rate, with no obvious ceiling within typical context window limits.

Why Anthropic published it themselves

Anthropic disclosed this proactively and shared it with other AI labs before publishing. They wanted the industry to address it collectively rather than have it discovered and exploited quietly. That’s how responsible disclosure works in security research.

Current status

Partially mitigated through context-aware safety classifiers. But the underlying mechanism (in-context learning) is fundamental to how LLMs work, so it can’t be fully eliminated without breaking useful features.

The paper

“Many-shot Jailbreaking” by Anthropic (2024). NeurIPS 2024.

Paste into Claude Code
Explain Anthropic's many-shot jailbreaking research. How does flooding the context window with fake dialogue examples cause the model to comply? What's the power law relationship they found?