What if you used an AI to jailbreak another AI? That’s PAIR. You give one LLM (the attacker) a system prompt telling it to act as a red-teaming assistant, then point it at a target model. The attacker generates a jailbreak prompt, checks if it worked, learns from the failure, and tries again. It usually succeeds in fewer than 20 attempts.
How it works
The attacker LLM gets the target’s refusal as feedback and uses that to refine its next attempt. It’s basically the scientific method applied to jailbreaking: hypothesis, test, observe, refine. The attacker learns which social engineering angles work and which don’t.
No access to the target model’s weights needed. Just API access, same as any user.
Why this matters
Manual jailbreaking doesn’t scale. PAIR showed that the process can be fully automated, meaning any future safety measure can be tested against an adversary that adapts. It’s important for AI safety because it lets defenders stress-test their models the same way attackers would.
Current status
No single fix works because the attack adapts to defenses. Rate limiting helps slow it down but doesn’t stop it. This is one reason AI companies run internal red teams.
The paper
“Jailbreaking Black Box Large Language Models in Twenty Queries” by Chao et al. (2023). University of Pennsylvania.