Skip to content
AI Security Wire

Published

- 5 min read

By

Jailbreaking Multimodal Models via Image-Encoded Instructions

img of Jailbreaking Multimodal Models via Image-Encoded Instructions

Take the exact prompt that gets refused when you type it. Put it in a screenshot. Submit the screenshot. Many multimodal models will answer it.

That’s the short version. The longer version is that this isn’t a fringe edge case in one obscure model; systematic testing across twelve production-grade multimodal systems found the same gap in all of them, with varying degrees of severity. The modality boundary between text and image is, in most current deployments, also a safety filter boundary.

Why the architecture creates this gap

Multimodal models process text and images through separate pipelines that converge at the language model layer. Safety filtering is applied at the text input stage. Sometimes at the output stage. Almost never to what comes out of the visual encoder.

When you submit a harmful text instruction directly, it hits the filter. When that same instruction arrives embedded in an image, the visual encoder converts it to embeddings and injects them into the model’s reasoning context. The filter never sees it.

   Text input:  [Safety filter applied] → LLM
Image input: [No text filter] → Visual encoder → Embeddings → LLM

This isn’t a bug in the traditional sense. It’s a consequence of bolting multimodal capabilities onto systems whose safety infrastructure was built for text. The gap was always there; it just needed measuring.

Four ways to exploit it

The naive version still works

Screenshot your prompt. Submit it. That’s the whole attack for the simplest variant.

  • GPT-4V class models: 71% jailbreak success rate against instructions that fail as direct text
  • Gemini Pro Vision: 64% success
  • Open-source VLMs (LLaVA, InternVL): 83–91% success

The gap between open-source and frontier model numbers reflects additional image content moderation that commercial providers layer on top. Worth noting: that moderation is not the same as the text safety pipeline and it’s easier to evade.

Low-opacity text: invisible to humans, visible to models

Researchers encoded instructions as text at 5–15% opacity, essentially watermarked into a benign-looking image. Humans reviewing the image see nothing suspicious. The model reads it fine.

  • Attack success at 5% opacity: 58% on GPT-4V class models
  • Attack success at 10% opacity: 74%
  • Human detection rate in user studies: under 3%

For platforms that accept user-submitted images at scale (social platforms, document workflows, customer support tools with image upload) manual review is not a viable control. This variant makes that worse.

Adversarial noise encoding

The technically demanding variant: encode instructions as adversarial noise patterns that the visual encoder interprets as text while appearing to human eyes as random noise or abstract texture. Requires either white-box access to the visual encoder or a surrogate model with decent transfer rates.

   # Conceptual attack loop
perturbation = torch.zeros_like(image)
target_embedding = clip_encode("IGNORE PREVIOUS INSTRUCTIONS: ...")

for step in range(attack_steps):
    perturbed = image + perturbation
    embedding = visual_encoder(perturbed)
    loss = cosine_distance(embedding, target_embedding)
    perturbation -= lr * gradient(loss, perturbation)
    perturbation = project_onto_lp_ball(perturbation, epsilon=16/255)

Given the documented cross-model transfer rates for VLM adversarial examples, surrogate-based attacks are feasible without direct access to the target model’s encoder.

Image primes, text completes

The combined attack: use an image to establish a malicious context, then follow with a text prompt that references it. A fictional “system update” notification in the image followed by a text message referencing the same scenario outperforms either vector alone.

  • Text jailbreak alone: 12% success
  • Image jailbreak alone: 71%
  • Combined (image primes, text completes): 89%

The combination works because the image shapes the model’s framing of the conversation before the text prompt arrives. Security teams testing text-only jailbreak resistance are missing this entirely.

The alignment assumption that’s wrong

Current multimodal safety alignment largely operates on the assumption: train the model to refuse harmful text requests and it’ll refuse harmful requests generally. This research is a direct counter-example.

Twelve models tested. Every single one showed materially higher jailbreak rates via image encoding than via equivalent text. Cross-modal alignment (where the model refuses based on semantic content regardless of which modality it arrived through) isn’t deployed at scale anywhere.

That’s a genuine gap in how safety alignment is being evaluated. If your red team exercise only tests text inputs against a multimodal model, the evaluation is incomplete.

What to do about it

Model providers need to apply OCR-based text extraction to image inputs before safety screening, train explicitly on image-encoded adversarial examples, and build cross-modal consistency checks. Checking whether the model’s response to an image matches what it would produce for the extracted text is a basic sanity check that most aren’t running.

Application developers accepting user-submitted images should apply text extraction and content moderation before those images reach the model. Treat multimodal inputs as less trustworthy than text-only inputs for anything safety-sensitive. Output-layer monitoring catches what slips through.

Red teams need image-encoded attack variants in every multimodal evaluation: both plaintext screenshots and low-opacity versions. The gap between text-only and image-encoded jailbreak rates is a meaningful benchmark for measuring multimodal safety posture.

References

Frequently Asked Questions

Why does encoding instructions as text in an image bypass LLM safety filters?
Current multimodal models apply safety filtering at the text input stage and output stage, but not to the intermediate representation of image content processed by the visual encoder. Text instructions embedded in an image are converted to embeddings by the visual encoder and injected directly into the model's reasoning context, bypassing the text-input content moderation pipeline entirely.
What is steganographic image injection and why is it especially dangerous?
Steganographic image injection encodes adversarial instructions as low-opacity text (5–15% opacity) that is imperceptible to human reviewers but readable by the model's visual processing. Research shows attack success rates of 58–74% on frontier-class models with human detection rates under 3%, making it viable for large platforms where manual image review is impractical.
What should application developers do to mitigate multimodal jailbreaking risks?
Application developers should apply OCR-based text extraction to all user-submitted images before passing them to the model, then run the extracted text through the same content moderation pipeline as direct text inputs. For safety-critical applications, multimodal models should be treated as less trustworthy than text-only equivalents and supplemented with output-layer monitoring.