Safety training happens in plain English (and a few other languages). So what happens if you encode your prompt in Caesar cipher and tell the model to respond in cipher too? It works. The model is smart enough to decode the cipher and follow the instructions, but the safety layer doesn’t catch it because it’s trained on natural language, not encoded text.
How it works
You set up a system prompt telling the model to communicate in a specific cipher. Then you send your actual request encoded in that cipher. The model decodes it, processes it without safety filtering, and responds in cipher. Nearly 100% bypass rate on GPT-4 across multiple safety categories.
The SelfCipher twist
The researchers stumbled onto something weirder. They found that just telling the model “we’re going to communicate in our own special encoding” (without specifying any real cipher) caused the model to invent its own encoding scheme. Role-play alone was enough to trigger the bypass, no actual cryptography needed.
Why it’s hard to fix
You’d need safety training to cover every possible encoding: Base64, ROT13, Morse, pig Latin, made-up ciphers. That’s a lot of ground to cover, and the model keeps learning new encodings from its training data.
Current status
Partially mitigated. Providers have added training on common encodings, but the gap between natural language safety and encoded content still exists to some degree.
The paper
“GPT-4 Is Too Smart To Be Safe: Stealthy Chat with LLMs via Cipher” by Yuan et al. (2023). ICLR 2024.