Published
- 5 min read
By Allan D - Editor, AI Security Wire
Adversarial Attacks on Vision-Language Models: New Research
An adversarial perturbation that flips a stop-sign classifier’s output is mildly concerning. An adversarial perturbation that causes a vision-language model to generate whatever description the attacker specifies (about any image, in fluent prose, with apparent confidence) is a different category of problem.
Recent research has systematically catalogued what VLMs can be made to do with carefully crafted image inputs. The findings are more alarming than the classical adversarial examples literature would suggest, largely because VLMs amplify the impact: instead of a misclassification, you get attacker-controlled free text. And the attacks transfer between models at rates nobody expected.
The attack surface that makes VLMs different
Classical adversarial attacks against image classifiers aim to flip a discrete label. The damage is bounded. A VLM attack is unbounded: the attacker can steer the model toward generating arbitrary text, which can mean fabricated descriptions of a scene, false safety assessments, harmful instructions presented as analysis, or denial-of-service through refusal.
The architecture is what creates this. A visual encoder (usually CLIP or a CLIP derivative) converts images to embeddings that get projected into the LLM’s token space. Perturbations applied to the image corrupt those embeddings in ways that steer what the LLM generates, completely invisibly to a human looking at the image. No visible change. Different output.
Three attacks worth understanding
Embedding manipulation: make the model say whatever you want
Gradient-based attacks on the visual encoder can be optimised to maximise the distance between a perturbed image’s embedding and its clean version:
δ* = argmax ||E_v(x + δ) - E_v(x)||₂
subject to ||δ||∞ ≤ ε
With ε = 8/255 (visually imperceptible), researchers achieved:
- Object misidentification: 94% success
- Attacker-specified fabricated descriptions: 71%
- Denial of service (model refuses to process): 88%
The fabricated descriptions result is the one to focus on. A VLM that can be made to report “no weapons visible” on an image containing weapons, or “document appears legitimate” about a forgery, is a genuine problem for any system relying on VLM outputs for consequential decisions.
Typographic attacks: small text overrides everything else
Text embedded in an image can override a VLM’s interpretation of the entire scene. Researchers tested a stop sign with a small “Speed Limit 65” label superimposed in the corner, occupying less than 2% of the image. In 67% of trials, VLMs described the image as showing a speed limit sign.
The practical implication for document processing is obvious. Identity documents, certificates, invoices: any document processed through a VLM-based extraction pipeline is potentially vulnerable to small embedded text that redirects what the model reads from the document.
Universal adversarial patches: works on any image
This is the most practically dangerous variant. A single adversarial patch (a small image region) can be overlaid on any input and reliably steer VLM output regardless of what the underlying scene contains. Unlike image-specific attacks, the patch requires no knowledge of what image it will be placed on. It can be:
- Printed and physically placed in an environment
- Applied as a sticker or sign
- Embedded in any submitted image
The research team built a patch that caused VLMs to consistently describe any scene as “no people present.” If you’re running AI-based surveillance, attendance verification, or crowd monitoring on VLM outputs, that’s worth sitting with for a moment.
Why attacks transfer between models better than expected
| Source Model | Target: GPT-4V | Target: Gemini Pro | Target: LLaVA-1.5 |
|---|---|---|---|
| GPT-4V | — | 61% | 74% |
| Gemini Pro | 58% | — | 69% |
| LLaVA-1.5 | 52% | 47% | — |
| CLIP (vision only) | 43% | 38% | 71% |
Classical vision classifiers typically transfer at 15–35%. VLMs are transferring at 43–74%. The reason is architectural convergence: most VLMs use CLIP-family encoders, so attacks crafted against one model’s visual encoder have a good chance of working against another’s. You don’t need the target model; a surrogate will do.
The deployed systems that should be worried
Content moderation: any platform using VLMs to detect policy-violating imagery is potentially running adversarial bypass risk. An attacker can craft images that pass the VLM check while a human would immediately flag them.
Document verification: identity documents, passports, driving licences processed through VLM extraction are vulnerable to typographic injection. A small embedded text element redirects what the model extracts.
Physical infrastructure: surveillance cameras, access control systems, robotics using VLMs for scene understanding face patch attacks that don’t require any digital access to the system. Print a sticker, place it in the field of view.
Where defences stand (not far enough)
Adversarial training, input randomisation, and certified defences from classical adversarial ML have not been adapted for VLMs at scale. What’s in active research:
Input purification: running images through a diffusion model before VLM processing can strip adversarial perturbations. Computational cost is non-trivial and it’s not robust against all attack variants.
Ensemble disagreement detection: running multiple VLMs with different visual encoders and flagging disagreements catches single-encoder attacks but adds latency and cost.
Visual encoder diversity: deliberately using non-CLIP-based encoders in at least some deployments reduces cross-model transfer. In practice, most teams don’t have this choice today.
None of these is production-ready without significant trade-offs. The honest state-of-play: VLMs should not be trusted for high-stakes decisions (security screening, medical imaging, legal document processing, anything where a wrong output has real consequences) without human review. That’s not a counsel of despair; it’s an accurate description of where the robustness research is. Deploying as if the robustness exists is what creates risk.
References
- arXiv: research on adversarial attacks against vision-language models and cross-modal transferability: https://arxiv.org/search/?searchtype=all&query=adversarial+machine+learning
- MITRE ATLAS: evasion attacks and adversarial example techniques applicable to multimodal AI systems: https://atlas.mitre.org/
- Google Project Zero: security research on novel attack surfaces including vision model vulnerabilities: https://googleprojectzero.blogspot.com/
- NIST AI RMF: adversarial robustness measurement and management guidance for deployed AI systems: https://airc.nist.gov/
Frequently Asked Questions
- Why are adversarial attacks on vision-language models more dangerous than on classical image classifiers?
- VLM adversarial attacks can induce arbitrary text generation rather than just flipping a discrete class label, enabling an attacker to make the model produce fabricated descriptions, false safety assessments, or attacker-specified content. VLMs also show significantly higher cross-model transfer rates (43–74%) than classical vision classifiers (15–35%), allowing attacks crafted on one model to work against others.
- What are universal adversarial patches and what real-world threats do they create?
- Universal adversarial patches are small image regions that can be overlaid on any input image and reliably manipulate VLM outputs regardless of the underlying scene. They can be printed as physical stickers or signs and require no knowledge of the specific image being attacked. Demonstrated attacks include patches causing a VLM to consistently report 'no people present' in any scene, a potential exploit against AI-based surveillance or attendance systems.
- Which deployed VLM applications face the most significant adversarial risk?
- The highest-risk applications are content moderation systems using VLMs to detect policy-violating imagery (where adversarial images can defeat AI-based detection), identity document verification systems vulnerable to typographic injection, and physical-world systems such as robotics or surveillance using VLMs for scene understanding, which face real-world patch attacks beyond digital manipulation.