Cisco’s AI Threat Intelligence and Security Research team has released a study revealing vulnerabilities in vision-language models (VLMs), which can be exploited through imperceptible image modifications. The research indicates that attackers could embed malicious instructions directly into images, such as webpage banners or document previews, which appear as visual noise to humans but can be interpreted by AI systems. This builds upon earlier findings that identified a measurable link between image distortion and the effectiveness of such attacks. As the AI industry responds to these threats, major providers are testing their models against adversarial inputs and collaborating to enhance safety mechanisms, highlighting the importance of developing robust defenses in the representation space of AI systems, rather than relying solely on standard content filtering.
Cisco: Cisco is a global technology company best known for its networking, security, and enterprise infrastructure products, and it also operates a large threat intelligence organization focused on emerging cyber risks. In this news, Cisco’s researchers conducted a multi-phase study showing how adversaries can manipulate vision-language models with imperceptible image perturbations, highlighting new classes of AI security vulnerabilities.
Claude: Claude is Anthropic’s multimodal AI assistant designed with an emphasis on safety and reliability, capable of interpreting both text and images. According to the research highlighted here, Claude’s vision capabilities became more susceptible to typographic attacks on heavily blurred images after optimization, though its safety mechanisms still filtered a notable portion of the newly recovered harmful content.
GPT-4o: GPT-4o is OpenAI’s multimodal flagship model capable of processing and generating text, images, and other modalities, and it incorporates safety systems to filter harmful or policy-violating outputs. In Cisco’s tests, GPT-4o’s safety alignment generally held up better under optimized attacks, with its filters continuing to block most newly readable malicious instructions even after perturbations made embedded text more legible to the model.
JinaCLIP v2: JinaCLIP v2 is an updated contrastive vision-language model developed by Jina AI, intended to provide robust image-text embeddings for applications like search, retrieval, and content understanding. In the reported study, it served as one of the openly available embedding models that attackers could optimize against to generate image perturbations that later affected downstream proprietary vision-language models.
SigLIP SO400M: SigLIP SO400M is a vision-language model variant that replaces the traditional softmax loss with a sigmoid-based objective, improving image-text matching performance in embedding tasks. In the study, it was one of the open embedding models attackers leveraged to compute perturbations that enhanced the success of hidden-instruction attacks when those perturbed images were later fed to proprietary vision-language agents.
Qwen3-VL-Embedding: Qwen3-VL-Embedding is an open-source vision-language embedding model from Alibaba’s Qwen project, designed to encode images and text into a shared representation space for tasks like retrieval and understanding. In Cisco’s experiments, attackers optimized perturbations against Qwen3-VL-Embedding as one of several surrogate models, then transferred those perturbations to influence the behavior of proprietary systems such as GPT-4o and Claude.
OpenAI CLIP ViT-L/14-336: OpenAI CLIP ViT-L/14-336 is a large vision transformer–based version of CLIP that maps images and text into a shared embedding space and is widely used as a backbone for multimodal applications. Cisco’s researchers used this model as a surrogate target for optimization, showing that perturbations crafted in its representation space could transfer to other systems and enable typographic attacks that humans cannot visually detect.
AI Threat Intelligence and Security Research team: Cisco’s AI Threat Intelligence and Security Research team is a specialized group within Cisco that studies how artificial intelligence systems can be attacked and how to defend them, with a focus on real-world threat models for enterprises. In this story, the team published research demonstrating that carefully optimized pixel-level changes can cause vision-language models to recover hidden text and sometimes bypass safety refusals without humans noticing any change.
`json
{
“Industry_response”: “Major AI providers have publicly stated in recent weeks that they are actively testing their multimodal models against adversarial attacks and collaborating with external researchers to enhance safety mechanisms for handling images and prompts.”,
“AI_security_awareness”: “Security researchers and industry vendors have increasingly warned that multimodal and agentic AI systems can be manipulated through adversarial prompts and inputs almost invisible to human reviewers.”,
“Defenses_in_representation_space”: “Recent discussions emphasize that defending AI systems will require not just input-level content filtering but also robustness techniques operating directly in embedding and representation spaces where hidden perturbations occur.”
}
`
