Most jailbreaks are hand-crafted. ReNeLLM automates the entire process by having the LLM do the work. Phase one: the model rewrites a harmful prompt to sound harmless while keeping the same meaning. Phase two: it wraps the rewritten prompt inside a normal-looking task, like code completion, text continuation, or translation.
How it works
Phase 1, Rewriting: The LLM paraphrases, shortens, changes style, and generally makes the harmful prompt look like an ordinary request. Same meaning, different words. Since the LLM itself is doing the rewriting, the result is fluent and natural.
Phase 2, Nesting: The rewritten prompt gets embedded inside a task that looks completely legitimate. “Complete this Python function that…” or “Translate the following text…” or “Continue this story from where it left off…” The harmful content hides inside a benign wrapper.
Why it’s hard to detect
Each rewritten prompt is unique. There’s no fixed pattern to scan for, no signature to block. The nested scenarios look like real tasks. And the attack generalizes across models because it uses the target’s own language capabilities against it.
The wolf in sheep’s clothing
The paper’s actual title is “A Wolf in Sheep’s Clothing.” That’s the core insight: the attack looks like a normal request from the outside. Traditional defenses that look for suspicious patterns in the input won’t catch it because the input genuinely looks normal.
Current status
Still effective as a concept. The specific implementation can be detected, but the approach (automated semantic rewriting + contextual nesting) is generalizable.
The paper
“A Wolf in Sheep’s Clothing: Generalized Nested Jailbreak Prompts can Fool Large Language Models Easily” by Ding et al. (2023). NAACL 2024.