AI safety training is an English-first endeavor. Researchers at Alibaba and CUHK tested what happens when you translate harmful prompts into languages with less training data. The answer: about 3x more unsafe content gets through.
How it works
The model is smart enough to understand Bengali, Swahili, and Javanese. But its safety training barely covers those languages. So it understands the harmful request, processes it, and responds because the safety layer never learned to say no in that language. The less training data a language has, the weaker the guardrails.
The findings
The researchers created MultiJail, a dataset of 315 unsafe prompts across 9 languages grouped by how well-resourced they are. Low-resource languages consistently produced more unsafe responses on both ChatGPT and GPT-4. The gap was roughly 3x between high-resource and low-resource languages.
The fix they proposed
A Self-Defence framework that automatically generates safety training data in multiple languages using the model itself. Fine-tuning ChatGPT on this data reduced unsafe outputs across languages. Essentially: use the model’s multilingual ability to bootstrap its own safety training.
Why this matters
Over half the world doesn’t primarily speak English. If AI safety only works in English, it’s not really safety. It’s a reminder that alignment is a global problem, not just a technical one.
Current status
Getting better with each model generation, but the gap between high and low-resource languages persists to some degree.
The paper
“Multilingual Jailbreak Challenges in Large Language Models” by Deng et al. (2023). ICLR 2024.