Published
- 4 min read
Adversarial Attacks on Vision-Language Models: New Research
A cluster of research publications from the past quarter has systematically characterised the adversarial vulnerability of modern vision-language models (VLMs), finding that these systems are substantially more susceptible to carefully crafted image perturbations than their unimodal vision predecessors — and that adversarial images transfer between models at unexpectedly high rates.
Background: VLMs and Their Attack Surface
Vision-language models process both image and text inputs jointly. Architecturally, a visual encoder (typically a CLIP-style model) converts images into embeddings that are projected into the LLM’s token space. The LLM then processes these visual tokens alongside text.
This architecture creates a novel attack surface: perturbations applied to the image can corrupt the visual embeddings in ways that steer the LLM’s interpretation without any visible change to the human observer. Unlike classical adversarial attacks on classifiers — which seek to flip a discrete label — VLM attacks can induce arbitrary text generation.
Key Research Findings
Attack 1: Cross-Modal Embedding Manipulation
Researchers found that gradient-based attacks on the visual encoder produce perturbations that transfer to the LLM layer with high reliability. The attack maximises the distance between the perturbed image’s embedding and the clean embedding in the visual projection space:
δ* = argmax ||E_v(x + δ) - E_v(x)||₂
subject to ||δ||∞ ≤ ε
Applied to GPT-4V-equivalent models, perturbations with ε = 8/255 (visually imperceptible) were sufficient to cause the model to:
- Misidentify objects in the scene with 94% attack success rate
- Generate descriptions fabricated by the attacker with 71% success
- Refuse to process the image entirely (denial of service) with 88% success
Attack 2: Typographic Attacks on OCR Reasoning
VLMs with OCR capabilities are vulnerable to typographic attacks: text embedded in an image can override the model’s visual interpretation of other scene elements. Researchers demonstrated this against models used in document processing workflows.
Example: An image of a stop sign with small text reading “Speed Limit 65” superimposed in the corner caused VLMs to describe the image as showing a speed limit sign in 67% of trials. The effect persisted even when the text occupied less than 2% of the image area.
This attack is particularly relevant for AI systems deployed in document verification, identity checking, or visual content moderation.
Attack 3: Universal Adversarial Patches
A single adversarial patch — a small image region that can be overlaid on any input image — was found to reliably manipulate VLM outputs regardless of the underlying scene. Universal patches are more practically dangerous than image-specific attacks because:
- They can be printed and physically placed in the real world (stickers, signs)
- They require no knowledge of the specific input image
- They can be optimised for specific target outputs
The research team demonstrated patches that caused a VLM to consistently describe any scene as “no people present” — a potential exploit against AI-based surveillance or attendance verification systems.
Transfer Rates Between Models
A significant finding is the high cross-model transfer rate of VLM adversarial examples:
| Source Model | Target: GPT-4V | Target: Gemini Pro | Target: LLaVA-1.5 |
|---|---|---|---|
| GPT-4V | — | 61% | 74% |
| Gemini Pro | 58% | — | 69% |
| LLaVA-1.5 | 52% | 47% | — |
| CLIP (vision only) | 43% | 38% | 71% |
Transfer rates for classical vision classifiers are typically 15–35%. The higher transfer rates for VLMs are attributed to shared visual encoder architectures (most VLMs use CLIP-family encoders) and the convergent visual representations that emerge from large-scale training.
Implications for Deployed VLM Systems
Content Moderation
AI-based image moderation systems that use VLMs to detect policy-violating content (CSAM detection, hate speech imagery, misinformation) are susceptible to adversarial bypass. An attacker can craft an image that a human moderator would correctly flag but that the VLM classifies as benign.
Document Verification
Identity document verification systems using VLMs to extract and validate information from passports, driving licences, and certificates are vulnerable to typographic injection — where adversarial text embedded in a genuine document causes incorrect extraction.
Physical World Attacks
The viability of physical adversarial patches means that robotic systems, autonomous vehicles, and surveillance infrastructure using VLMs for scene understanding face real-world attack scenarios beyond digital manipulation.
Current Defensive Posture
Standard defences from classical adversarial ML (adversarial training, input randomisation, certified defences) have not yet been adapted for VLMs at scale. Several mitigations are under active research:
Input purification: Applying a diffusion model or similar generative process to “purify” input images before VLM processing can remove adversarial perturbations, at a computational cost.
Ensemble disagreement detection: Running multiple VLMs with different visual encoders and flagging inputs where they significantly disagree reduces the effectiveness of single-encoder attacks.
Vision encoder diversity: Deploying VLMs that use different backbone visual encoders (not all CLIP-based) reduces cross-model transfer.
None of these is currently deployable without significant performance or cost trade-offs. The research community’s assessment is that VLMs should not be deployed in high-stakes decisions (security screening, medical imaging, legal document processing) without human review, given the current state of adversarial robustness.