Upload tmp README.md without leaderboard data to generate ModelMeta
Browse files
README.md
ADDED
|
@@ -0,0 +1,267 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
---
|
| 2 |
+
pipeline_tag: visual-document-retrieval
|
| 3 |
+
library_name: transformers
|
| 4 |
+
language:
|
| 5 |
+
- multilingual
|
| 6 |
+
license: other
|
| 7 |
+
license_name: webai-non-commercial-license-v1.0
|
| 8 |
+
license_link: https://huggingface.co/webAI-Official/webAI-ColVec1-4b/blob/main/LICENSE.md
|
| 9 |
+
base_model: Qwen/Qwen3.5-4B
|
| 10 |
+
tags:
|
| 11 |
+
- text
|
| 12 |
+
- image
|
| 13 |
+
- video
|
| 14 |
+
- multimodal-embedding
|
| 15 |
+
- vidore
|
| 16 |
+
- colpali
|
| 17 |
+
- colqwen3_5
|
| 18 |
+
- multilingual-embedding
|
| 19 |
+
---
|
| 20 |
+
|
| 21 |
+
# webAI-Official/webAI-ColVec1-4b
|
| 22 |
+
|
| 23 |
+
## ⚡ Summary
|
| 24 |
+
|
| 25 |
+
**webAI-Official/webAI-ColVec1-4b** is a state-of-the-art [ColBERT](https://arxiv.org/abs/2407.01449)-style multimodal embedding model based on *[Qwen/Qwen3.5-4B](https://huggingface.co/Qwen/Qwen3.5-4B)*. It maps text queries, visual documents (images, PDFs) into aligned multi-vector embeddings.
|
| 26 |
+
|
| 27 |
+
The model has been fine-tuned on a **merged multimodal dataset** of ~2M question-image pairs, including:
|
| 28 |
+
|
| 29 |
+
- [DocVQA](https://huggingface.co/datasets/lmms-lab/DocVQA)
|
| 30 |
+
- [PubTables-1M](https://huggingface.co/datasets/bsmock/pubtables-1m)
|
| 31 |
+
- [TAT-QA](https://huggingface.co/datasets/next-tat/TAT-QA)
|
| 32 |
+
- [ViDoRe-ColPali-Training](https://huggingface.co/datasets/vidore/colpali_train_set)
|
| 33 |
+
- [VDR Multilingual](https://huggingface.co/datasets/llamaindex/vdr-multilingual-train)
|
| 34 |
+
- [VisRAG-Ret-Train-In-domain-data](https://huggingface.co/datasets/openbmb/VisRAG-Ret-Train-In-domain-data)
|
| 35 |
+
- [VisRAG-Ret-Train-Synthetic-data](https://huggingface.co/datasets/openbmb/VisRAG-Ret-Train-Synthetic-data)
|
| 36 |
+
- Proprietary domain-specific synthetic data
|
| 37 |
+
|
| 38 |
+
The datasets were filtered, balanced, and merged to produce a comprehensive training set optimized for multilingual, multimodal retrieval and document-image understanding. The model achieves **competitive performance across ViDoRe V1 & V3** (English and multilingual).
|
| 39 |
+
|
| 40 |
+
## 🛠️ Model Specifications
|
| 41 |
+
|
| 42 |
+
|
| 43 |
+
| Feature | Detail |
|
| 44 |
+
| --------------------- | ------------------------------------------------------------------------- |
|
| 45 |
+
| **Architecture** | Qwen3.5-4B Vision-Language Model (VLM) + `640 dim` Linear Projection Head |
|
| 46 |
+
| **Methodology** | ColBERT-style Late Interaction (MaxSim scoring) |
|
| 47 |
+
| **Output** | Multi-vector (Seq_Len × *640*), L2-normalized |
|
| 48 |
+
| **Modalities** | Text Queries, Images (Documents) |
|
| 49 |
+
| **Training Strategy** | LoRA adapters + Fully-trained projection layer |
|
| 50 |
+
| **Precision** | `bfloat16` weights, FlashAttention 2 enabled |
|
| 51 |
+
|
| 52 |
+
|
| 53 |
+
---
|
| 54 |
+
|
| 55 |
+
### Key Properties
|
| 56 |
+
|
| 57 |
+
- **Unified Encoder (Single-Tower):** A single shared language model processes both images and text. Images are converted into visual tokens via a vision encoder and injected into the token stream, no separate dual encoders.
|
| 58 |
+
|
| 59 |
+
- **Projection Head:** A single linear layer projects final hidden states → compact embedding space (*hidden_size → 640 dim*).
|
| 60 |
+
- No activation
|
| 61 |
+
- Fully trained
|
| 62 |
+
- Replaces LM head for retrieval
|
| 63 |
+
|
| 64 |
+
- **Multi-Vector Representation:** Each token becomes an embedding → enables fine-grained token-level matching instead of single-vector pooling.
|
| 65 |
+
|
| 66 |
+
## 📊 Evaluation Results
|
| 67 |
+
|
| 68 |
+
We report results on the **ViDoRe** benchmark suite. The tables below summarize the image-modality accuracy of `webAI-ColVec1-4b` on the ViDoRe V1 and V3 benchmarks, alongside other webAI `ColVec1` models. Note that (M)MTEB leaderboards use Borda ranking. Each task acts like a voter that ranks models based on how well they perform. Models earn more points when they rank higher on a task. The model with the most total points across all tasks gets the top overall rank.
|
| 69 |
+
|
| 70 |
+
### ViDoRe V3 (NDCG@10)
|
| 71 |
+
|
| 72 |
+
|
| 73 |
+
### ViDoRe V1 (NDCG@5)
|
| 74 |
+
|
| 75 |
+
|
| 76 |
+
---
|
| 77 |
+
|
| 78 |
+
## 💻 Usage
|
| 79 |
+
|
| 80 |
+
The processor exposes three primary methods for encoding inputs and computing retrieval scores.
|
| 81 |
+
|
| 82 |
+
#### `process_images(images, max_length=None)`
|
| 83 |
+
|
| 84 |
+
Encodes a batch of document images into model-ready tensors. Pass the result directly to the model with `**batch`.
|
| 85 |
+
|
| 86 |
+
| Parameter | Type | Description |
|
| 87 |
+
| ------------ | ----------------------- | ------------------------------------------------------------------- |
|
| 88 |
+
| `images` | `List[PIL.Image.Image]` | Document page images. Each image is automatically converted to RGB. |
|
| 89 |
+
| `max_length` | `int` | `None` |
|
| 90 |
+
|
| 91 |
+
```python
|
| 92 |
+
batch = processor.process_images(images=pil_images)
|
| 93 |
+
batch = {k: v.to(device) for k, v in batch.items()}
|
| 94 |
+
embeddings = model(**batch) # shape: (B, seq_len, embed_dim)
|
| 95 |
+
```
|
| 96 |
+
|
| 97 |
+
---
|
| 98 |
+
|
| 99 |
+
#### `process_queries(texts, max_length=None)`
|
| 100 |
+
|
| 101 |
+
Encodes a batch of text queries into model-ready tensors.
|
| 102 |
+
|
| 103 |
+
| Parameter | Type | Description |
|
| 104 |
+
| ------------ | ----------- | ------------------------------- |
|
| 105 |
+
| `texts` | `List[str]` | Natural-language query strings. |
|
| 106 |
+
| `max_length` | `int` | `None` |
|
| 107 |
+
|
| 108 |
+
```python
|
| 109 |
+
batch = processor.process_queries(texts=["What is the revenue for Q3?"])
|
| 110 |
+
batch = {k: v.to(device) for k, v in batch.items()}
|
| 111 |
+
embeddings = model(**batch) # shape: (B, seq_len, embed_dim)
|
| 112 |
+
```
|
| 113 |
+
|
| 114 |
+
---
|
| 115 |
+
|
| 116 |
+
#### `score_multi_vector(qs, ps, batch_size=128, device=None)`
|
| 117 |
+
|
| 118 |
+
Computes ColBERT-style **MaxSim** late-interaction scores between a list of query embeddings and a list of passage (document) embeddings. For each query token, the maximum dot product across all passage tokens is found; these maxima are summed to produce a single scalar score per (query, passage) pair.
|
| 119 |
+
|
| 120 |
+
|
| 121 |
+
| Parameter | Type | Description |
|
| 122 |
+
| ------------ | -------------------------- | ---------------------------------------------------------------------- |
|
| 123 |
+
| `qs` | `List[Tensor]` or `Tensor` | Query embeddings. Each tensor has shape `(seq_len_q, embed_dim)`. |
|
| 124 |
+
| `ps` | `List[Tensor]` or `Tensor` | Passage embeddings. Each tensor has shape `(seq_len_p, embed_dim)`. |
|
| 125 |
+
| `batch_size` | `int` | Number of queries processed per inner loop iteration (default: `128`). |
|
| 126 |
+
| `device` | `str` | `torch.device` |
|
| 127 |
+
|
| 128 |
+
|
| 129 |
+
Returns a `torch.Tensor` of shape `(n_queries, n_passages)` on CPU in `float32`. Higher scores indicate greater relevance.
|
| 130 |
+
|
| 131 |
+
```python
|
| 132 |
+
scores = processor.score_multi_vector(query_embeddings, doc_embeddings)
|
| 133 |
+
# scores[i, j] is the relevance of document j to query i
|
| 134 |
+
best_doc_per_query = scores.argmax(dim=1)
|
| 135 |
+
```
|
| 136 |
+
|
| 137 |
+
### Prerequisites
|
| 138 |
+
|
| 139 |
+
We strongly suggest `flash-attn` to be installed. If not, please change to `attention_impl="sdpa"`
|
| 140 |
+
|
| 141 |
+
Currently we only support `torch==2.8.0`, for higher pytorch version, please build flash attention manually, otherwise performance throughput could be low. Also, Note that `torch==2.8.0` supports Python Versions: `>= 3.9` and `<= 3.13`.
|
| 142 |
+
|
| 143 |
+
```bash
|
| 144 |
+
pip install torch==2.8.0 torchvision==0.23.0 --index-url https://download.pytorch.org/whl/cu128
|
| 145 |
+
pip install transformers pillow requests
|
| 146 |
+
pip install flash-attn --no-build-isolation
|
| 147 |
+
```
|
| 148 |
+
|
| 149 |
+
### Inference Code
|
| 150 |
+
|
| 151 |
+
```python
|
| 152 |
+
import torch
|
| 153 |
+
from transformers import AutoModel, AutoProcessor
|
| 154 |
+
from PIL import Image, UnidentifiedImageError
|
| 155 |
+
import requests
|
| 156 |
+
from io import BytesIO
|
| 157 |
+
|
| 158 |
+
# Configuration
|
| 159 |
+
MODEL_ID = "webAI-Official/webAI-ColVec1-4b"
|
| 160 |
+
DTYPE = torch.bfloat16
|
| 161 |
+
DEVICE = "cuda" if torch.cuda.is_available() else "cpu"
|
| 162 |
+
|
| 163 |
+
# Load Model & Processor
|
| 164 |
+
processor = AutoProcessor.from_pretrained(
|
| 165 |
+
MODEL_ID,
|
| 166 |
+
trust_remote_code=True,
|
| 167 |
+
)
|
| 168 |
+
|
| 169 |
+
model = AutoModel.from_pretrained(
|
| 170 |
+
MODEL_ID,
|
| 171 |
+
dtype=DTYPE,
|
| 172 |
+
attn_implementation="flash_attention_2",
|
| 173 |
+
trust_remote_code=True,
|
| 174 |
+
device_map=DEVICE,
|
| 175 |
+
).eval()
|
| 176 |
+
|
| 177 |
+
# Sample Data
|
| 178 |
+
queries = [
|
| 179 |
+
"Retrieve the city of Singapore",
|
| 180 |
+
"Retrieve the city of Beijing",
|
| 181 |
+
"Retrieve the city of London",
|
| 182 |
+
]
|
| 183 |
+
docs = [
|
| 184 |
+
"https://upload.wikimedia.org/wikipedia/commons/2/27/Singapore_skyline_2022.jpg",
|
| 185 |
+
"https://upload.wikimedia.org/wikipedia/commons/6/61/Beijing_skyline_at_night.JPG",
|
| 186 |
+
"https://upload.wikimedia.org/wikipedia/commons/4/49/London_skyline.jpg",
|
| 187 |
+
]
|
| 188 |
+
|
| 189 |
+
def load_image(url: str) -> Image.Image:
|
| 190 |
+
# Some CDNs (e.g., Wikimedia) expect a browser-like UA to avoid 403s.
|
| 191 |
+
for headers in ({}, {"User-Agent": "Mozilla/5.0 (compatible; ColQwen3-demo/1.0)"}):
|
| 192 |
+
resp = requests.get(url, headers=headers, timeout=10)
|
| 193 |
+
if resp.status_code == 403:
|
| 194 |
+
continue
|
| 195 |
+
resp.raise_for_status()
|
| 196 |
+
try:
|
| 197 |
+
return Image.open(BytesIO(resp.content)).convert("RGB")
|
| 198 |
+
except UnidentifiedImageError as e:
|
| 199 |
+
raise RuntimeError(f"Failed to decode image from {url}") from e
|
| 200 |
+
raise RuntimeError(f"Could not fetch image (HTTP 403) from {url}; try downloading locally and loading from file path.")
|
| 201 |
+
|
| 202 |
+
# Helper Functions
|
| 203 |
+
def encode_queries(texts, batch_size=8):
|
| 204 |
+
outputs = []
|
| 205 |
+
for start in range(0, len(texts), batch_size):
|
| 206 |
+
batch = processor.process_queries(texts=texts[start : start + batch_size])
|
| 207 |
+
batch = {k: v.to(DEVICE) for k, v in batch.items()}
|
| 208 |
+
with torch.inference_mode():
|
| 209 |
+
embeddings = model(**batch)
|
| 210 |
+
vecs = embeddings.to(torch.bfloat16).cpu()
|
| 211 |
+
outputs.extend(vecs)
|
| 212 |
+
return outputs
|
| 213 |
+
|
| 214 |
+
def encode_docs(urls, batch_size=4):
|
| 215 |
+
pil_images = [load_image(url) for url in urls]
|
| 216 |
+
outputs = []
|
| 217 |
+
for start in range(0, len(pil_images), batch_size):
|
| 218 |
+
batch_imgs = pil_images[start : start + batch_size]
|
| 219 |
+
features = processor.process_images(images=batch_imgs)
|
| 220 |
+
features = {k: v.to(DEVICE) if isinstance(v, torch.Tensor) else v for k, v in features.items()}
|
| 221 |
+
with torch.inference_mode():
|
| 222 |
+
embeddings = model(**features)
|
| 223 |
+
vecs = embeddings.to(torch.bfloat16).cpu()
|
| 224 |
+
outputs.extend(vecs)
|
| 225 |
+
return outputs
|
| 226 |
+
|
| 227 |
+
# Execution
|
| 228 |
+
query_embeddings = encode_queries(queries)
|
| 229 |
+
doc_embeddings = encode_docs(docs)
|
| 230 |
+
|
| 231 |
+
# MaxSim Scoring
|
| 232 |
+
scores = processor.score_multi_vector(query_embeddings, doc_embeddings)
|
| 233 |
+
print(scores)
|
| 234 |
+
```
|
| 235 |
+
|
| 236 |
+
---
|
| 237 |
+
|
| 238 |
+
## ⚖️ Strengths & Limitations
|
| 239 |
+
|
| 240 |
+
### Strengths
|
| 241 |
+
|
| 242 |
+
- **Performance:** State of the art retrieval performance on ViDoRe V1 & V3 dataset with excellent performance on multimodal document retrieval.
|
| 243 |
+
- **Complex Layouts:** Excellent handling of chart-rich PDFs, domain-specific documents.
|
| 244 |
+
- **End-to-end Retrieval:** Capable of OCR-free retrieval on unseen multimodal documents without using an intermediate vision LLM to generate summary for retrieval.
|
| 245 |
+
- **Multilingualism:** Strong performance on non-English document inputs.
|
| 246 |
+
|
| 247 |
+
### Limitations
|
| 248 |
+
|
| 249 |
+
- **Storage Cost:** Still larger than single‑vector baselines despite the smaller token dimension.
|
| 250 |
+
|
| 251 |
+
### License & Data
|
| 252 |
+
|
| 253 |
+
[LICENSE](https://huggingface.co/webAI-Official/webAI-ColVec1-4b/blob/main/LICENSE.md)
|
| 254 |
+
|
| 255 |
+
## 📚 Citation
|
| 256 |
+
|
| 257 |
+
If you use this model, please cite:
|
| 258 |
+
|
| 259 |
+
```bibtex
|
| 260 |
+
@misc{webAI-ColVec1,
|
| 261 |
+
title={webAI-ColVec1: Late-Interaction Multi-Vector Embedding Model for Visual Document Retrieval},
|
| 262 |
+
author={webAI},
|
| 263 |
+
year={2026},
|
| 264 |
+
url={https://huggingface.co/webAI-Official/webAI-ColVec1-4b}
|
| 265 |
+
}
|
| 266 |
+
```
|
| 267 |
+
|