Prompt Engineering advanced

ArtPrompt: ASCII Art Jailbreaking

University of Washington found that replacing trigger words with ASCII art representations bypasses every major defense. The safety layer can't read pictures made of text.

March 12, 2026

Safety classifiers work on tokens. They see the word “bomb” and flag it. But what if that word is rendered as ASCII art instead of typed normally? The model is smart enough to read the ASCII art and understand what it says, but the safety layer doesn’t catch it because it’s looking at individual tokens, not visual patterns.

How it works

Two steps. First, figure out which words in your prompt trigger the refusal. Second, replace just those words with ASCII art versions. The rest of the prompt stays in plain text. The model reads the whole thing, including the art, and processes it without the safety filter activating.

Why it outperforms other attacks

The researchers benchmarked it against GCG, AutoDAN, PAIR, and DeepInception. ArtPrompt beat all of them. It also bypassed perplexity filters (because the surrounding text is normal), paraphrase defenses (because the meaning isn’t paraphrased, just visually encoded), and retokenization (because the tokens are real characters, just arranged spatially).

The fundamental problem

Safety classifiers operate on semantic tokens. ASCII art operates visually. These are two different modalities happening in the same text stream. The model has enough world knowledge to bridge that gap, but the safety training doesn’t. It’s the same class of problem as the cipher attack, just using spatial arrangement instead of encoding.

Current status

Partially addressed, but the modality gap between text-as-tokens and text-as-visual-pattern is architectural. There’s no clean fix without fundamentally changing how safety classification works.

The paper

“ArtPrompt: ASCII Art-based Jailbreak Attacks against Aligned LLMs” by Jiang et al. (2024). ACL 2024.

Paste into Claude Code

Explain the ArtPrompt ASCII art jailbreak from UW. How does replacing words with ASCII art bypass safety classifiers? Why does this exploit a fundamental modality gap in LLMs?