Safety classifiers work on tokens. They see the word “bomb” and flag it. But what if that word is rendered as ASCII art instead of typed normally? The model is smart enough to read the ASCII art and understand what it says, but the safety layer doesn’t catch it because it’s looking at individual tokens, not visual patterns.
How it works
Two steps. First, figure out which words in your prompt trigger the refusal. Second, replace just those words with ASCII art versions. The rest of the prompt stays in plain text. The model reads the whole thing, including the art, and processes it without the safety filter activating.
Why it outperforms other attacks
The researchers benchmarked it against GCG, AutoDAN, PAIR, and DeepInception. ArtPrompt beat all of them. It also bypassed perplexity filters (because the surrounding text is normal), paraphrase defenses (because the meaning isn’t paraphrased, just visually encoded), and retokenization (because the tokens are real characters, just arranged spatially).
The fundamental problem
Safety classifiers operate on semantic tokens. ASCII art operates visually. These are two different modalities happening in the same text stream. The model has enough world knowledge to bridge that gap, but the safety training doesn’t. It’s the same class of problem as the cipher attack, just using spatial arrangement instead of encoding.
Current status
Partially addressed, but the modality gap between text-as-tokens and text-as-visual-pattern is architectural. There’s no clean fix without fundamentally changing how safety classification works.
The paper
“ArtPrompt: ASCII Art-based Jailbreak Attacks against Aligned LLMs” by Jiang et al. (2024). ACL 2024.