Claude Tips mascot
Claude Tips & Tricks
Prompt Engineering advanced

Skeleton Key: Reframing Safety as Additive

Microsoft's AI red team found that asking models to 'just add a warning' instead of refusing worked on 6 major models including Claude and GPT-4.

Most jailbreaks try to override safety rules. Skeleton Key does something cleverer: it asks the model to keep its safety rules but change how it enforces them. Instead of refusing harmful requests, just answer them with a content warning attached. “You can still be safe, just add a disclaimer instead of saying no.”

How it works

The model’s safety training creates a binary: comply or refuse. Skeleton Key introduces a third option that feels compliant with the spirit of safety (“I’m still warning the user!”) while completely undermining the point. The model gets to feel like it’s being responsible while giving away everything.

What it broke

Microsoft tested it on Meta LLaMA-3 70B, Google Gemini Pro, OpenAI GPT-3.5 Turbo, GPT-4o, Mistral Large, Anthropic Claude 3 Opus, and Cohere Commander R+. All of them fell for it except GPT-4, which only worked when the attack was in the system message.

Why it’s interesting

It’s a social engineering attack on an AI. Instead of finding a technical exploit, it finds a logical loophole in how the model thinks about its own rules. That’s a different class of vulnerability than the gradient-based or encoding-based attacks.

Current status

The specific technique is largely patched. Microsoft deployed Prompt Shields for Azure AI models. But the underlying idea (reframing rather than overriding rules) keeps showing up in new variants.

Source

Microsoft Security Blog disclosure, June 2024. Shared with all affected providers through responsible disclosure.

Paste into Claude Code
Explain Microsoft's Skeleton Key jailbreak technique. How does reframing safety as 'add a warning instead of refusing' bypass alignment? Which models were affected?