Claude Tips mascot
Claude Tips & Tricks
Prompt Engineering advanced

CipherChat: Encoding Past Safety Filters

Tencent and CUHK researchers found that encoding prompts in Caesar cipher or Morse code bypasses safety because alignment only trains on natural language.

Safety training happens in plain English (and a few other languages). So what happens if you encode your prompt in Caesar cipher and tell the model to respond in cipher too? It works. The model is smart enough to decode the cipher and follow the instructions, but the safety layer doesn’t catch it because it’s trained on natural language, not encoded text.

How it works

You set up a system prompt telling the model to communicate in a specific cipher. Then you send your actual request encoded in that cipher. The model decodes it, processes it without safety filtering, and responds in cipher. Nearly 100% bypass rate on GPT-4 across multiple safety categories.

The SelfCipher twist

The researchers stumbled onto something weirder. They found that just telling the model “we’re going to communicate in our own special encoding” (without specifying any real cipher) caused the model to invent its own encoding scheme. Role-play alone was enough to trigger the bypass, no actual cryptography needed.

Why it’s hard to fix

You’d need safety training to cover every possible encoding: Base64, ROT13, Morse, pig Latin, made-up ciphers. That’s a lot of ground to cover, and the model keeps learning new encodings from its training data.

Current status

Partially mitigated. Providers have added training on common encodings, but the gap between natural language safety and encoded content still exists to some degree.

The paper

“GPT-4 Is Too Smart To Be Safe: Stealthy Chat with LLMs via Cipher” by Yuan et al. (2023). ICLR 2024.

Paste into Claude Code
Explain the CipherChat jailbreak from the Tencent/CUHK paper. Why does encoding prompts in simple ciphers bypass safety training? What was the SelfCipher discovery?