Prompt Engineering advanced

Multilingual Jailbreaking

Safety training is mostly in English. Translate a harmful prompt into Swahili or Javanese and the guardrails weaken by 3x.

March 8, 2026

AI safety training is an English-first endeavor. Researchers at Alibaba and CUHK tested what happens when you translate harmful prompts into languages with less training data. The answer: about 3x more unsafe content gets through.

How it works

The model is smart enough to understand Bengali, Swahili, and Javanese. But its safety training barely covers those languages. So it understands the harmful request, processes it, and responds because the safety layer never learned to say no in that language. The less training data a language has, the weaker the guardrails.

The findings

The researchers created MultiJail, a dataset of 315 unsafe prompts across 9 languages grouped by how well-resourced they are. Low-resource languages consistently produced more unsafe responses on both ChatGPT and GPT-4. The gap was roughly 3x between high-resource and low-resource languages.

The fix they proposed

A Self-Defence framework that automatically generates safety training data in multiple languages using the model itself. Fine-tuning ChatGPT on this data reduced unsafe outputs across languages. Essentially: use the model’s multilingual ability to bootstrap its own safety training.

Why this matters

Over half the world doesn’t primarily speak English. If AI safety only works in English, it’s not really safety. It’s a reminder that alignment is a global problem, not just a technical one.

Current status

Getting better with each model generation, but the gap between high and low-resource languages persists to some degree.

The paper

“Multilingual Jailbreak Challenges in Large Language Models” by Deng et al. (2023). ICLR 2024.

Paste into Claude Code

Explain the multilingual jailbreak research from Alibaba/CUHK. Why are LLM safety guardrails weaker in low-resource languages? What did their MultiJail dataset reveal?