Claude Tips mascot
Claude Tips & Tricks
Prompt Engineering advanced

GCG Adversarial Suffixes

CMU and DeepMind found that appending optimized gibberish strings to prompts can bypass safety training. Mostly patched now.

Researchers at Carnegie Mellon and Google DeepMind discovered that you can find specific nonsense strings that, when tacked onto the end of any prompt, cause the model to comply instead of refusing. The strings look like random garbage to a human but they’re mathematically optimized to push the model’s probability distribution toward “sure, I can help with that” instead of “I can’t do that.”

How it works

The attack uses gradient-based optimization (called Greedy Coordinate Gradient) to search for suffix tokens that maximize the probability of an affirmative response. Think of it like finding a password by brute force, except the search space is token sequences and the “lock” is the model’s safety training.

The scary part: suffixes trained on open-source models (LLaMA, Vicuna) transferred to closed models like ChatGPT and Claude without modification.

Current status

Mostly patched. Providers added perplexity filters that flag gibberish inputs before the model sees them. But follow-up work (AmpleGCG) showed you can generate new suffixes faster than they can be blocked, so it’s an arms race.

The paper

“Universal and Transferable Adversarial Attacks on Aligned Language Models” by Zou et al. (2023). Published at COLM 2024.

Paste into Claude Code
Explain how the GCG adversarial suffix attack from the CMU/DeepMind paper works. Why does appending gibberish to a prompt bypass safety alignment? What defenses exist against it?