Cisco’s AI Threat Intelligence and Security Research team has released a new study detailing vulnerabilities in vision-language models (VLM), which are AI systems that interpret images. The research demonstrates that attackers can manipulate these models by embedding malicious instructions within images, which are indecipherable to humans but readable by AI. This follows earlier findings that linked visual distortion to the success of such attacks, revealing that adjustments to poorly readable images can enhance their effectiveness. Specifically, the latest phase of Cisco’s research showed that perturbations could improve the readability of blurred images, enabling harmful commands to circumvent safety filters. This highlights the urgent need for more robust defenses against adversarial inputs in AI systems.
Cisco: Cisco is a global technology company focused on networking, security, and enterprise infrastructure, with a large research arm dedicated to cybersecurity and AI threat analysis. In this news, Cisco’s AI Threat Intelligence and Security Research team led a multi-phase study demonstrating how vision-language models can be systematically attacked using imperceptible image perturbations, highlighting new risks for AI-powered systems and defenses.
Claude: Claude is an AI assistant developed by Anthropic that emphasizes safer, more steerable behavior using constitutional AI techniques and multimodal capabilities, including vision-language understanding in newer versions. In the study, Claude was evaluated as a target system to see how adversarially optimized images could recover hidden instructions or erode safety refusals, revealing both its susceptibility to certain vision attacks and the robustness of its safety filters in catching many newly readable prompts.
GPT-4o: GPT-4o is OpenAI’s multimodal flagship model capable of processing text, images, and other inputs with improved latency and alignment compared to prior generations. In the reported experiments, GPT-4o was tested as a target for transferred adversarial images, where it showed relatively strong safety alignment by recognizing more of the newly readable malicious content while still blocking many harmful instructions.
JinaCLIP v2: JinaCLIP v2 is a vision-language embedding model released by Jina AI that builds on CLIP-style architectures to provide stronger, more robust multimodal representations for retrieval and understanding tasks. In this work, it was one of the open models used to optimize adversarial perturbations, demonstrating how attacks tuned on accessible embeddings can generalize to proprietary AI systems.
SigLIP SO400M: SigLIP SO400M is a large-scale vision-language model based on the SigLIP architecture, which replaces traditional softmax losses with a sigmoid-based objective to improve contrastive image-text training. The Cisco study treated SigLIP SO400M as another public embedding target for crafting perturbations, helping show that adversarial optimization in its feature space can induce hidden readability and safety-refusal failures in downstream VLMs.
Qwen3-VL-Embedding: Qwen3-VL-Embedding is a vision-language embedding model from the Qwen family that encodes images and text into a joint semantic space to support multimodal applications. Cisco’s researchers used Qwen3-VL-Embedding as one of the optimization targets for generating bounded pixel-level perturbations, showing that adversarial examples crafted in its representation space can transfer to closed models like GPT-4o and Claude.
OpenAI CLIP ViT-L/14-336: OpenAI CLIP ViT-L/14-336 is a widely used vision-language embedding model that maps images and text into a shared representation space, enabling tasks like zero-shot image classification and retrieval. In this research, it served as one of the open-source embedding models that attackers optimized against to craft perturbed images, with those adversarial examples then transferred to more restricted systems like GPT-4o and Claude.
AIProductHardening: Major AI labs and cloud providers have begun updating their safety and content-filtering pipelines for multimodal models to account for non-obvious visual perturbations, emphasizing internal representation checks in addition to surface-level image scanning.
EmbeddingSpaceAttacks: Multiple academic and industry groups have highlighted that optimizing attacks in embedding space, rather than directly probing a target model, can reliably transfer adversarial prompts across different vision-language architectures.
VisionLanguageSecurity: Recent security research has increasingly focused on typographic and image-based jailbreaks for multimodal models, showing that instructions hidden in seemingly benign visuals can bypass traditional text filters.
