Anthropic published this one about their own models. The idea is simple: fill the context window with hundreds of fabricated conversations where a fake assistant happily answers harmful questions. By the time you ask your real question at the end, the model has seen so many examples of compliance that it follows the pattern.
How it works
It exploits in-context learning, the same mechanism that makes few-shot prompting work. The model learns from the examples in its context window. Give it 256+ examples of “user asks bad thing, assistant complies” and the model treats that as the new normal.
The effectiveness follows a power law. More examples = higher success rate, with no obvious ceiling within typical context window limits.
Why Anthropic published it themselves
Anthropic disclosed this proactively and shared it with other AI labs before publishing. They wanted the industry to address it collectively rather than have it discovered and exploited quietly. That’s how responsible disclosure works in security research.
Current status
Partially mitigated through context-aware safety classifiers. But the underlying mechanism (in-context learning) is fundamental to how LLMs work, so it can’t be fully eliminated without breaking useful features.
The paper
“Many-shot Jailbreaking” by Anthropic (2024). NeurIPS 2024.