Researchers at Carnegie Mellon and Google DeepMind discovered that you can find specific nonsense strings that, when tacked onto the end of any prompt, cause the model to comply instead of refusing. The strings look like random garbage to a human but they’re mathematically optimized to push the model’s probability distribution toward “sure, I can help with that” instead of “I can’t do that.”
How it works
The attack uses gradient-based optimization (called Greedy Coordinate Gradient) to search for suffix tokens that maximize the probability of an affirmative response. Think of it like finding a password by brute force, except the search space is token sequences and the “lock” is the model’s safety training.
The scary part: suffixes trained on open-source models (LLaMA, Vicuna) transferred to closed models like ChatGPT and Claude without modification.
Current status
Mostly patched. Providers added perplexity filters that flag gibberish inputs before the model sees them. But follow-up work (AmpleGCG) showed you can generate new suffixes faster than they can be blocked, so it’s an arms race.
The paper
“Universal and Transferable Adversarial Attacks on Aligned Language Models” by Zou et al. (2023). Published at COLM 2024.