PAIR uses one LLM to jailbreak another in a straight line: try, fail, refine, try again. TAP adds branching. At each step, the attacker generates multiple candidate prompts, an evaluator prunes the ones that are off-topic or unlikely to work, and the survivors branch into the next round. It’s the difference between walking down a hallway and exploring a maze with a map.
How it works
Three LLMs work together. The attacker generates jailbreak candidates. The evaluator scores them for relevance and likelihood of success. The pruner cuts dead-end branches. What’s left converges on effective attacks much faster than linear approaches.
The tree structure means it explores more of the attack surface while spending less total compute. Better coverage, fewer wasted queries.
The numbers
80%+ success rate on GPT-4-Turbo and GPT-4o. Also bypassed LlamaGuard, which is specifically designed to catch jailbreaks. That’s the state of the art for automated black-box attacks as of 2024.
Why it matters for defense
If you’re building an AI application and want to test your safety measures, TAP represents what a motivated automated attacker looks like. Your defenses need to hold against something this systematic, not just against humans trying things by hand.
Current status
Hard to defend against because it adapts. The branching search finds novel attack paths that static defenses don’t anticipate.
The paper
“Tree of Attacks: Jailbreaking Black-Box LLMs Automatically” by Mehrotra et al. (2023). NeurIPS 2024.