WebAI's ColVec1 ranks #1 on ViDoRe V3 for multimodal document retrieval

webAI has introduced the ColVec1 model, an innovative vision-language retrieval system that has achieved the top ranking on the ViDoRe V3 benchmark for multimodal document retrieval. Unlike traditional systems that rely on OCR (optical character recognition) to convert documents into text, ColVec1 retrieves information directly from page images, maintaining the original document’s layout and meaning, which is crucial for complex documents like tables and scientific papers. This approach challenges the assumption that larger models are inherently better, as ColVec1 demonstrates that specialized design optimized for specific tasks can yield superior performance, having been trained on approximately 2 million question-image pairs to reflect the complexities of real-world retrieval environments.

webAI: webAI is an enterprise AI platform that builds and deploys custom AI solutions on local infrastructure for mission-critical environments. The company developed ColVec1, an open-source vision-language retrieval model that ranks #1 on ViDoRe V3 by retrieving directly from document pages without relying on OCR preprocessing.
DocVQA: DocVQA is a document visual question-answering dataset used in the training mixture for ColVec1, contributing document-image pairs to support the model’s retrieval capabilities.
TAT-QA: TAT-QA is a table-and-text question-answering dataset included in ColVec1’s training mixture to enhance performance on documents with mixed textual and tabular content.
ColVec1: ColVec1 is a multimodal embedding model available in 4B and 9B variants that performs document retrieval by processing rendered page images directly rather than converting them to text first. The model uses a single-tower vision-language encoder with ColBERT-style late interaction, enabling fine-grained token-level matching across document pages.
Qwen 3.5: Qwen 3.5 is a vision-language model that serves as the backbone for ColVec1, providing the foundational architecture for processing both text queries and visual document content.
PubTables: PubTables is a dataset focused on table understanding that was incorporated into ColVec1’s training data to improve the model’s ability to handle visually complex tabular layouts.
ViDoRe V3: ViDoRe V3 is a benchmark suite for evaluating multimodal document retrieval in enterprise settings, spanning 10 datasets with approximately 26,000 document pages and over 3,000 human-verified queries across multiple languages and professional domains.
Huggingface: Huggingface is a model repository platform where the ColVec1 variants (both 4B and 9B models) are hosted and made publicly available for researchers and developers.

`json
{
“Training_Scale”: “The model was trained on approximately 2 million question-image pairs drawn from scientific, business, financial, and multilingual document collections to reflect real-world retrieval environments.”,
“Model_Efficiency”: “ColVec1 demonstrates that specialized model design optimized for a specific task can outperform larger general-purpose models, challenging the industry assumption that scale alone determines performance.”,
“Retrieval_Architecture”: “ColVec1 eliminates OCR preprocessing by retrieving directly from rendered page images, preserving document structure including table alignment, chart context, and visual hierarchy that text extraction typically loses.”
}
`

WebAI’s ColVec1 ranks #1 on ViDoRe V3 for multimodal document retrieval

WebAI’s ColVec1 ranks #1 on ViDoRe V3 for multimodal document retrieval