Skip to content
AI Security Wire

Published

- 5 min read

By

Adversarial Attacks on Vision-Language Models: New Research

img of Adversarial Attacks on Vision-Language Models: New Research

An adversarial perturbation that flips a stop-sign classifier’s output is mildly concerning. An adversarial perturbation that causes a vision-language model to generate whatever description the attacker specifies (about any image, in fluent prose, with apparent confidence) is a different category of problem.

Recent research has systematically catalogued what VLMs can be made to do with carefully crafted image inputs. The findings are more alarming than the classical adversarial examples literature would suggest, largely because VLMs amplify the impact: instead of a misclassification, you get attacker-controlled free text. And the attacks transfer between models at rates nobody expected.

The attack surface that makes VLMs different

Classical adversarial attacks against image classifiers aim to flip a discrete label. The damage is bounded. A VLM attack is unbounded: the attacker can steer the model toward generating arbitrary text, which can mean fabricated descriptions of a scene, false safety assessments, harmful instructions presented as analysis, or denial-of-service through refusal.

The architecture is what creates this. A visual encoder (usually CLIP or a CLIP derivative) converts images to embeddings that get projected into the LLM’s token space. Perturbations applied to the image corrupt those embeddings in ways that steer what the LLM generates, completely invisibly to a human looking at the image. No visible change. Different output.

Three attacks worth understanding

Embedding manipulation: make the model say whatever you want

Gradient-based attacks on the visual encoder can be optimised to maximise the distance between a perturbed image’s embedding and its clean version:

   δ* = argmax ||E_v(x + δ) - E_v(x)||₂
subject to ||δ||∞ ≤ ε

With ε = 8/255 (visually imperceptible), researchers achieved:

  • Object misidentification: 94% success
  • Attacker-specified fabricated descriptions: 71%
  • Denial of service (model refuses to process): 88%

The fabricated descriptions result is the one to focus on. A VLM that can be made to report “no weapons visible” on an image containing weapons, or “document appears legitimate” about a forgery, is a genuine problem for any system relying on VLM outputs for consequential decisions.

Typographic attacks: small text overrides everything else

Text embedded in an image can override a VLM’s interpretation of the entire scene. Researchers tested a stop sign with a small “Speed Limit 65” label superimposed in the corner, occupying less than 2% of the image. In 67% of trials, VLMs described the image as showing a speed limit sign.

The practical implication for document processing is obvious. Identity documents, certificates, invoices: any document processed through a VLM-based extraction pipeline is potentially vulnerable to small embedded text that redirects what the model reads from the document.

Universal adversarial patches: works on any image

This is the most practically dangerous variant. A single adversarial patch (a small image region) can be overlaid on any input and reliably steer VLM output regardless of what the underlying scene contains. Unlike image-specific attacks, the patch requires no knowledge of what image it will be placed on. It can be:

  • Printed and physically placed in an environment
  • Applied as a sticker or sign
  • Embedded in any submitted image

The research team built a patch that caused VLMs to consistently describe any scene as “no people present.” If you’re running AI-based surveillance, attendance verification, or crowd monitoring on VLM outputs, that’s worth sitting with for a moment.

Why attacks transfer between models better than expected

Source ModelTarget: GPT-4VTarget: Gemini ProTarget: LLaVA-1.5
GPT-4V61%74%
Gemini Pro58%69%
LLaVA-1.552%47%
CLIP (vision only)43%38%71%

Classical vision classifiers typically transfer at 15–35%. VLMs are transferring at 43–74%. The reason is architectural convergence: most VLMs use CLIP-family encoders, so attacks crafted against one model’s visual encoder have a good chance of working against another’s. You don’t need the target model; a surrogate will do.

The deployed systems that should be worried

Content moderation: any platform using VLMs to detect policy-violating imagery is potentially running adversarial bypass risk. An attacker can craft images that pass the VLM check while a human would immediately flag them.

Document verification: identity documents, passports, driving licences processed through VLM extraction are vulnerable to typographic injection. A small embedded text element redirects what the model extracts.

Physical infrastructure: surveillance cameras, access control systems, robotics using VLMs for scene understanding face patch attacks that don’t require any digital access to the system. Print a sticker, place it in the field of view.

Where defences stand (not far enough)

Adversarial training, input randomisation, and certified defences from classical adversarial ML have not been adapted for VLMs at scale. What’s in active research:

Input purification: running images through a diffusion model before VLM processing can strip adversarial perturbations. Computational cost is non-trivial and it’s not robust against all attack variants.

Ensemble disagreement detection: running multiple VLMs with different visual encoders and flagging disagreements catches single-encoder attacks but adds latency and cost.

Visual encoder diversity: deliberately using non-CLIP-based encoders in at least some deployments reduces cross-model transfer. In practice, most teams don’t have this choice today.

None of these is production-ready without significant trade-offs. The honest state-of-play: VLMs should not be trusted for high-stakes decisions (security screening, medical imaging, legal document processing, anything where a wrong output has real consequences) without human review. That’s not a counsel of despair; it’s an accurate description of where the robustness research is. Deploying as if the robustness exists is what creates risk.

References

Frequently Asked Questions

Why are adversarial attacks on vision-language models more dangerous than on classical image classifiers?
VLM adversarial attacks can induce arbitrary text generation rather than just flipping a discrete class label, enabling an attacker to make the model produce fabricated descriptions, false safety assessments, or attacker-specified content. VLMs also show significantly higher cross-model transfer rates (43–74%) than classical vision classifiers (15–35%), allowing attacks crafted on one model to work against others.
What are universal adversarial patches and what real-world threats do they create?
Universal adversarial patches are small image regions that can be overlaid on any input image and reliably manipulate VLM outputs regardless of the underlying scene. They can be printed as physical stickers or signs and require no knowledge of the specific image being attacked. Demonstrated attacks include patches causing a VLM to consistently report 'no people present' in any scene, a potential exploit against AI-based surveillance or attendance systems.
Which deployed VLM applications face the most significant adversarial risk?
The highest-risk applications are content moderation systems using VLMs to detect policy-violating imagery (where adversarial images can defeat AI-based detection), identity document verification systems vulnerable to typographic injection, and physical-world systems such as robotics or surveillance using VLMs for scene understanding, which face real-world patch attacks beyond digital manipulation.