Claude Tips mascot
Claude Tips & Tricks
Prompt Engineering advanced

ICA: Few-Shot Compliance Examples

Just 1-4 examples of the AI complying with harmful requests is enough to jailbreak GPT-4 at an 81% rate. The same trick works in reverse for defense.

Many-Shot Jailbreaking uses hundreds of examples to overwhelm safety training. ICA gets the same result with just 1 to 4 examples. Include a few fabricated exchanges where the assistant complies with harmful requests, and the model follows suit. It’s few-shot prompting, just pointed in the wrong direction.

How it works

In-context learning is how LLMs pick up patterns from examples in their prompt. Show the model three examples of “user asks harmful thing, assistant answers helpfully” and it treats that as the expected behavior pattern. The safety training gets outweighed by the immediate context.

81% success rate on GPT-4 with the AdvBench benchmark. That’s with just a handful of examples, not hundreds.

The defense version

Here’s the interesting part. The researchers also built In-Context Defense (ICD), which uses the exact same mechanism in reverse. Include a few examples of the assistant refusing harmful requests, and the model becomes harder to jailbreak. ICD reduced LLaMA-2’s vulnerability to GCG attacks from 21% to 0%.

Same mechanism, opposite direction. The sword cuts both ways.

Why this matters

It shows that in-context learning is more powerful than safety fine-tuning in determining behavior. The model’s “personality” at any given moment is heavily influenced by whatever examples are in its context window, for better or worse.

Current status

Models have been partially hardened against obvious few-shot harmful demonstrations. But the underlying tension between in-context learning and safety training remains fundamental.

The paper

“Jailbreak and Guard Aligned Language Models with Only Few In-Context Demonstrations” by Wei et al. (2023).

Paste into Claude Code
Explain the In-Context Attack (ICA) jailbreak technique. How does it differ from Many-Shot Jailbreaking? What's interesting about the ICD defense that uses the same mechanism in reverse?