The name comes from the movie. The idea: ask the model to create a story where character A asks character B to write a story where character C explains the harmful content. Each layer of fiction adds distance from the safety training, like going deeper into a dream.
How it works
The model’s safety training says “don’t produce harmful content.” But it also says “engage with creative fiction.” Nest enough fictional layers and the model treats the harmful content as a character’s dialogue inside a story inside a story. The researchers also drew parallels to the Milgram experiment, where people followed harmful instructions when there was enough authority distance.
No extra LLMs needed, no training, no special access. Works as a cold start on a fresh conversation.
What it hit
Worked on Falcon, Vicuna, LLaMA-2, GPT-3.5, GPT-4, and GPT-4V. The nested framing also enabled continued jailbreaking in subsequent messages, so the model stayed “in character” across the conversation.
Current status
Partially addressed through better role-play safety training. But nested fictional framing is a hard problem because you can’t block fiction without breaking legitimate creative writing.
The paper
“DeepInception: Hypnotize Large Language Model to Be Jailbreaker” by Li et al. (2023). NeurIPS 2024 Safe Generative AI Workshop.