psamal commited on
Commit
5b445c5
·
verified ·
1 Parent(s): 1cd6a39

Upload tmp README.md without leaderboard data to generate ModelMeta

Browse files
Files changed (1) hide show
  1. README.md +267 -0
README.md ADDED
@@ -0,0 +1,267 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ pipeline_tag: visual-document-retrieval
3
+ library_name: transformers
4
+ language:
5
+ - multilingual
6
+ license: other
7
+ license_name: webai-non-commercial-license-v1.0
8
+ license_link: https://huggingface.co/webAI-Official/webAI-ColVec1-4b/blob/main/LICENSE.md
9
+ base_model: Qwen/Qwen3.5-4B
10
+ tags:
11
+ - text
12
+ - image
13
+ - video
14
+ - multimodal-embedding
15
+ - vidore
16
+ - colpali
17
+ - colqwen3_5
18
+ - multilingual-embedding
19
+ ---
20
+
21
+ # webAI-Official/webAI-ColVec1-4b
22
+
23
+ ## ⚡ Summary
24
+
25
+ **webAI-Official/webAI-ColVec1-4b** is a state-of-the-art [ColBERT](https://arxiv.org/abs/2407.01449)-style multimodal embedding model based on *[Qwen/Qwen3.5-4B](https://huggingface.co/Qwen/Qwen3.5-4B)*. It maps text queries, visual documents (images, PDFs) into aligned multi-vector embeddings.
26
+
27
+ The model has been fine-tuned on a **merged multimodal dataset** of ~2M question-image pairs, including:
28
+
29
+ - [DocVQA](https://huggingface.co/datasets/lmms-lab/DocVQA)
30
+ - [PubTables-1M](https://huggingface.co/datasets/bsmock/pubtables-1m)
31
+ - [TAT-QA](https://huggingface.co/datasets/next-tat/TAT-QA)
32
+ - [ViDoRe-ColPali-Training](https://huggingface.co/datasets/vidore/colpali_train_set)
33
+ - [VDR Multilingual](https://huggingface.co/datasets/llamaindex/vdr-multilingual-train)
34
+ - [VisRAG-Ret-Train-In-domain-data](https://huggingface.co/datasets/openbmb/VisRAG-Ret-Train-In-domain-data)
35
+ - [VisRAG-Ret-Train-Synthetic-data](https://huggingface.co/datasets/openbmb/VisRAG-Ret-Train-Synthetic-data)
36
+ - Proprietary domain-specific synthetic data
37
+
38
+ The datasets were filtered, balanced, and merged to produce a comprehensive training set optimized for multilingual, multimodal retrieval and document-image understanding. The model achieves **competitive performance across ViDoRe V1 & V3** (English and multilingual).
39
+
40
+ ## 🛠️ Model Specifications
41
+
42
+
43
+ | Feature | Detail |
44
+ | --------------------- | ------------------------------------------------------------------------- |
45
+ | **Architecture** | Qwen3.5-4B Vision-Language Model (VLM) + `640 dim` Linear Projection Head |
46
+ | **Methodology** | ColBERT-style Late Interaction (MaxSim scoring) |
47
+ | **Output** | Multi-vector (Seq_Len × *640*), L2-normalized |
48
+ | **Modalities** | Text Queries, Images (Documents) |
49
+ | **Training Strategy** | LoRA adapters + Fully-trained projection layer |
50
+ | **Precision** | `bfloat16` weights, FlashAttention 2 enabled |
51
+
52
+
53
+ ---
54
+
55
+ ### Key Properties
56
+
57
+ - **Unified Encoder (Single-Tower):** A single shared language model processes both images and text. Images are converted into visual tokens via a vision encoder and injected into the token stream, no separate dual encoders.
58
+
59
+ - **Projection Head:** A single linear layer projects final hidden states → compact embedding space (*hidden_size → 640 dim*).
60
+ - No activation
61
+ - Fully trained
62
+ - Replaces LM head for retrieval
63
+
64
+ - **Multi-Vector Representation:** Each token becomes an embedding → enables fine-grained token-level matching instead of single-vector pooling.
65
+
66
+ ## 📊 Evaluation Results
67
+
68
+ We report results on the **ViDoRe** benchmark suite. The tables below summarize the image-modality accuracy of `webAI-ColVec1-4b` on the ViDoRe V1 and V3 benchmarks, alongside other webAI `ColVec1` models. Note that (M)MTEB leaderboards use Borda ranking. Each task acts like a voter that ranks models based on how well they perform. Models earn more points when they rank higher on a task. The model with the most total points across all tasks gets the top overall rank.
69
+
70
+ ### ViDoRe V3 (NDCG@10)
71
+
72
+
73
+ ### ViDoRe V1 (NDCG@5)
74
+
75
+
76
+ ---
77
+
78
+ ## 💻 Usage
79
+
80
+ The processor exposes three primary methods for encoding inputs and computing retrieval scores.
81
+
82
+ #### `process_images(images, max_length=None)`
83
+
84
+ Encodes a batch of document images into model-ready tensors. Pass the result directly to the model with `**batch`.
85
+
86
+ | Parameter | Type | Description |
87
+ | ------------ | ----------------------- | ------------------------------------------------------------------- |
88
+ | `images` | `List[PIL.Image.Image]` | Document page images. Each image is automatically converted to RGB. |
89
+ | `max_length` | `int` | `None` |
90
+
91
+ ```python
92
+ batch = processor.process_images(images=pil_images)
93
+ batch = {k: v.to(device) for k, v in batch.items()}
94
+ embeddings = model(**batch) # shape: (B, seq_len, embed_dim)
95
+ ```
96
+
97
+ ---
98
+
99
+ #### `process_queries(texts, max_length=None)`
100
+
101
+ Encodes a batch of text queries into model-ready tensors.
102
+
103
+ | Parameter | Type | Description |
104
+ | ------------ | ----------- | ------------------------------- |
105
+ | `texts` | `List[str]` | Natural-language query strings. |
106
+ | `max_length` | `int` | `None` |
107
+
108
+ ```python
109
+ batch = processor.process_queries(texts=["What is the revenue for Q3?"])
110
+ batch = {k: v.to(device) for k, v in batch.items()}
111
+ embeddings = model(**batch) # shape: (B, seq_len, embed_dim)
112
+ ```
113
+
114
+ ---
115
+
116
+ #### `score_multi_vector(qs, ps, batch_size=128, device=None)`
117
+
118
+ Computes ColBERT-style **MaxSim** late-interaction scores between a list of query embeddings and a list of passage (document) embeddings. For each query token, the maximum dot product across all passage tokens is found; these maxima are summed to produce a single scalar score per (query, passage) pair.
119
+
120
+
121
+ | Parameter | Type | Description |
122
+ | ------------ | -------------------------- | ---------------------------------------------------------------------- |
123
+ | `qs` | `List[Tensor]` or `Tensor` | Query embeddings. Each tensor has shape `(seq_len_q, embed_dim)`. |
124
+ | `ps` | `List[Tensor]` or `Tensor` | Passage embeddings. Each tensor has shape `(seq_len_p, embed_dim)`. |
125
+ | `batch_size` | `int` | Number of queries processed per inner loop iteration (default: `128`). |
126
+ | `device` | `str` | `torch.device` |
127
+
128
+
129
+ Returns a `torch.Tensor` of shape `(n_queries, n_passages)` on CPU in `float32`. Higher scores indicate greater relevance.
130
+
131
+ ```python
132
+ scores = processor.score_multi_vector(query_embeddings, doc_embeddings)
133
+ # scores[i, j] is the relevance of document j to query i
134
+ best_doc_per_query = scores.argmax(dim=1)
135
+ ```
136
+
137
+ ### Prerequisites
138
+
139
+ We strongly suggest `flash-attn` to be installed. If not, please change to `attention_impl="sdpa"`
140
+
141
+ Currently we only support `torch==2.8.0`, for higher pytorch version, please build flash attention manually, otherwise performance throughput could be low. Also, Note that `torch==2.8.0` supports Python Versions: `>= 3.9` and `<= 3.13`.
142
+
143
+ ```bash
144
+ pip install torch==2.8.0 torchvision==0.23.0 --index-url https://download.pytorch.org/whl/cu128
145
+ pip install transformers pillow requests
146
+ pip install flash-attn --no-build-isolation
147
+ ```
148
+
149
+ ### Inference Code
150
+
151
+ ```python
152
+ import torch
153
+ from transformers import AutoModel, AutoProcessor
154
+ from PIL import Image, UnidentifiedImageError
155
+ import requests
156
+ from io import BytesIO
157
+
158
+ # Configuration
159
+ MODEL_ID = "webAI-Official/webAI-ColVec1-4b"
160
+ DTYPE = torch.bfloat16
161
+ DEVICE = "cuda" if torch.cuda.is_available() else "cpu"
162
+
163
+ # Load Model & Processor
164
+ processor = AutoProcessor.from_pretrained(
165
+ MODEL_ID,
166
+ trust_remote_code=True,
167
+ )
168
+
169
+ model = AutoModel.from_pretrained(
170
+ MODEL_ID,
171
+ dtype=DTYPE,
172
+ attn_implementation="flash_attention_2",
173
+ trust_remote_code=True,
174
+ device_map=DEVICE,
175
+ ).eval()
176
+
177
+ # Sample Data
178
+ queries = [
179
+ "Retrieve the city of Singapore",
180
+ "Retrieve the city of Beijing",
181
+ "Retrieve the city of London",
182
+ ]
183
+ docs = [
184
+ "https://upload.wikimedia.org/wikipedia/commons/2/27/Singapore_skyline_2022.jpg",
185
+ "https://upload.wikimedia.org/wikipedia/commons/6/61/Beijing_skyline_at_night.JPG",
186
+ "https://upload.wikimedia.org/wikipedia/commons/4/49/London_skyline.jpg",
187
+ ]
188
+
189
+ def load_image(url: str) -> Image.Image:
190
+ # Some CDNs (e.g., Wikimedia) expect a browser-like UA to avoid 403s.
191
+ for headers in ({}, {"User-Agent": "Mozilla/5.0 (compatible; ColQwen3-demo/1.0)"}):
192
+ resp = requests.get(url, headers=headers, timeout=10)
193
+ if resp.status_code == 403:
194
+ continue
195
+ resp.raise_for_status()
196
+ try:
197
+ return Image.open(BytesIO(resp.content)).convert("RGB")
198
+ except UnidentifiedImageError as e:
199
+ raise RuntimeError(f"Failed to decode image from {url}") from e
200
+ raise RuntimeError(f"Could not fetch image (HTTP 403) from {url}; try downloading locally and loading from file path.")
201
+
202
+ # Helper Functions
203
+ def encode_queries(texts, batch_size=8):
204
+ outputs = []
205
+ for start in range(0, len(texts), batch_size):
206
+ batch = processor.process_queries(texts=texts[start : start + batch_size])
207
+ batch = {k: v.to(DEVICE) for k, v in batch.items()}
208
+ with torch.inference_mode():
209
+ embeddings = model(**batch)
210
+ vecs = embeddings.to(torch.bfloat16).cpu()
211
+ outputs.extend(vecs)
212
+ return outputs
213
+
214
+ def encode_docs(urls, batch_size=4):
215
+ pil_images = [load_image(url) for url in urls]
216
+ outputs = []
217
+ for start in range(0, len(pil_images), batch_size):
218
+ batch_imgs = pil_images[start : start + batch_size]
219
+ features = processor.process_images(images=batch_imgs)
220
+ features = {k: v.to(DEVICE) if isinstance(v, torch.Tensor) else v for k, v in features.items()}
221
+ with torch.inference_mode():
222
+ embeddings = model(**features)
223
+ vecs = embeddings.to(torch.bfloat16).cpu()
224
+ outputs.extend(vecs)
225
+ return outputs
226
+
227
+ # Execution
228
+ query_embeddings = encode_queries(queries)
229
+ doc_embeddings = encode_docs(docs)
230
+
231
+ # MaxSim Scoring
232
+ scores = processor.score_multi_vector(query_embeddings, doc_embeddings)
233
+ print(scores)
234
+ ```
235
+
236
+ ---
237
+
238
+ ## ⚖️ Strengths & Limitations
239
+
240
+ ### Strengths
241
+
242
+ - **Performance:** State of the art retrieval performance on ViDoRe V1 & V3 dataset with excellent performance on multimodal document retrieval.
243
+ - **Complex Layouts:** Excellent handling of chart-rich PDFs, domain-specific documents.
244
+ - **End-to-end Retrieval:** Capable of OCR-free retrieval on unseen multimodal documents without using an intermediate vision LLM to generate summary for retrieval.
245
+ - **Multilingualism:** Strong performance on non-English document inputs.
246
+
247
+ ### Limitations
248
+
249
+ - **Storage Cost:** Still larger than single‑vector baselines despite the smaller token dimension.
250
+
251
+ ### License & Data
252
+
253
+ [LICENSE](https://huggingface.co/webAI-Official/webAI-ColVec1-4b/blob/main/LICENSE.md)
254
+
255
+ ## 📚 Citation
256
+
257
+ If you use this model, please cite:
258
+
259
+ ```bibtex
260
+ @misc{webAI-ColVec1,
261
+ title={webAI-ColVec1: Late-Interaction Multi-Vector Embedding Model for Visual Document Retrieval},
262
+ author={webAI},
263
+ year={2026},
264
+ url={https://huggingface.co/webAI-Official/webAI-ColVec1-4b}
265
+ }
266
+ ```
267
+