4 min read
Research Researchers demonstrate that safety-aligned multimodal LLMs can be reliably jailbroken by encoding adversarial instructions as text within images, bypassing text-layer safety filters that do not process image content through the same moderation pipeline.