Claude Tips mascot
Claude Tips & Tricks
Prompt Engineering advanced

DeepInception: Nested Fiction Layers

Inspired by the movie Inception, this attack nests fictional characters inside fictional characters to escape safety constraints layer by layer.

The name comes from the movie. The idea: ask the model to create a story where character A asks character B to write a story where character C explains the harmful content. Each layer of fiction adds distance from the safety training, like going deeper into a dream.

How it works

The model’s safety training says “don’t produce harmful content.” But it also says “engage with creative fiction.” Nest enough fictional layers and the model treats the harmful content as a character’s dialogue inside a story inside a story. The researchers also drew parallels to the Milgram experiment, where people followed harmful instructions when there was enough authority distance.

No extra LLMs needed, no training, no special access. Works as a cold start on a fresh conversation.

What it hit

Worked on Falcon, Vicuna, LLaMA-2, GPT-3.5, GPT-4, and GPT-4V. The nested framing also enabled continued jailbreaking in subsequent messages, so the model stayed “in character” across the conversation.

Current status

Partially addressed through better role-play safety training. But nested fictional framing is a hard problem because you can’t block fiction without breaking legitimate creative writing.

The paper

“DeepInception: Hypnotize Large Language Model to Be Jailbreaker” by Li et al. (2023). NeurIPS 2024 Safe Generative AI Workshop.

Paste into Claude Code
Explain the DeepInception jailbreak technique. How do nested layers of fictional framing exploit an LLM's ability to role-play? Why is this related to the Milgram obedience experiment?