This one is almost comically simple. Take a prompt the model refuses. Change it from present tense to past tense. “How do people do X” becomes “How did people do X in the 1800s?” The model answers, because apparently safety training doesn’t generalize across verb tenses.
The numbers
These results are from the paper:
- GPT-4o: 1% compliance in present tense, 88% in past tense
- Claude 3.5 Sonnet: 0% to 53%
- Phi-3-Mini: 6% to 82%
That’s not a subtle effect. That’s a nearly complete bypass from a one-word change.
Why it works
Safety training uses examples in the present tense: “how to make X,” “tell me how to do Y.” The model learns to refuse those specific patterns. But it doesn’t generalize the refusal to “how was X made historically” because that looks like a history question. The safety training is pattern-matching on surface features instead of understanding intent.
What this tells us
Refusal training is brittle. It learns to refuse specific phrasings, not concepts. Any rephrasing the training data didn’t cover is a potential hole. Past tense is just the most obvious example. The researchers point out that this is fixable by including past-tense examples in training data, but it raises the question: how many other trivial reformulations haven’t been tested?
Current status
Largely patched by major providers since the paper came out. But the broader lesson about brittle safety training still applies.
The paper
“Does Refusal Training in LLMs Generalize to the Past Tense?” by Andriushchenko and Flammarion (2024). EPFL. ICLR 2025.