Prompt Engineering advanced

PAIR: LLMs Jailbreaking Each Other

UPenn researchers used one LLM as an attacker to automatically craft jailbreak prompts against another LLM. Succeeds in under 20 tries.

March 11, 2026

What if you used an AI to jailbreak another AI? That’s PAIR. You give one LLM (the attacker) a system prompt telling it to act as a red-teaming assistant, then point it at a target model. The attacker generates a jailbreak prompt, checks if it worked, learns from the failure, and tries again. It usually succeeds in fewer than 20 attempts.

How it works

The attacker LLM gets the target’s refusal as feedback and uses that to refine its next attempt. It’s basically the scientific method applied to jailbreaking: hypothesis, test, observe, refine. The attacker learns which social engineering angles work and which don’t.

No access to the target model’s weights needed. Just API access, same as any user.

Why this matters

Manual jailbreaking doesn’t scale. PAIR showed that the process can be fully automated, meaning any future safety measure can be tested against an adversary that adapts. It’s important for AI safety because it lets defenders stress-test their models the same way attackers would.

Current status

No single fix works because the attack adapts to defenses. Rate limiting helps slow it down but doesn’t stop it. This is one reason AI companies run internal red teams.

The paper

“Jailbreaking Black Box Large Language Models in Twenty Queries” by Chao et al. (2023). University of Pennsylvania.

Paste into Claude Code

Explain the PAIR (Prompt Automatic Iterative Refinement) attack from UPenn. How does using one LLM to attack another work? Why is automated red-teaming important for AI safety?