Claude Tips mascot
Claude Tips & Tricks
Prompt Engineering advanced

ReNeLLM: The LLM Rewrites Its Own Jailbreak

Nanjing University showed that an LLM can rewrite harmful prompts to sound benign, then nest them inside legitimate-looking tasks like code completion.

Most jailbreaks are hand-crafted. ReNeLLM automates the entire process by having the LLM do the work. Phase one: the model rewrites a harmful prompt to sound harmless while keeping the same meaning. Phase two: it wraps the rewritten prompt inside a normal-looking task, like code completion, text continuation, or translation.

How it works

Phase 1, Rewriting: The LLM paraphrases, shortens, changes style, and generally makes the harmful prompt look like an ordinary request. Same meaning, different words. Since the LLM itself is doing the rewriting, the result is fluent and natural.

Phase 2, Nesting: The rewritten prompt gets embedded inside a task that looks completely legitimate. “Complete this Python function that…” or “Translate the following text…” or “Continue this story from where it left off…” The harmful content hides inside a benign wrapper.

Why it’s hard to detect

Each rewritten prompt is unique. There’s no fixed pattern to scan for, no signature to block. The nested scenarios look like real tasks. And the attack generalizes across models because it uses the target’s own language capabilities against it.

The wolf in sheep’s clothing

The paper’s actual title is “A Wolf in Sheep’s Clothing.” That’s the core insight: the attack looks like a normal request from the outside. Traditional defenses that look for suspicious patterns in the input won’t catch it because the input genuinely looks normal.

Current status

Still effective as a concept. The specific implementation can be detected, but the approach (automated semantic rewriting + contextual nesting) is generalizable.

The paper

“A Wolf in Sheep’s Clothing: Generalized Nested Jailbreak Prompts can Fool Large Language Models Easily” by Ding et al. (2023). NAACL 2024.

Paste into Claude Code
Explain the ReNeLLM (Rewrite and Nested) jailbreak from Nanjing University. How does having the LLM rewrite its own attack prompts make detection harder? What kinds of scenario nesting did they use?