psamal commited on
Commit
3767539
·
verified ·
1 Parent(s): 9d5fbfe

README update

Browse files
Files changed (1) hide show
  1. README.md +31 -22
README.md CHANGED
@@ -17,36 +17,28 @@ tags:
17
  - colqwen3_5
18
  - multilingual-embedding
19
  ---
 
20
  # webAI-Official/webAI-ColVec1-9b
21
 
22
  ## ⚡ Summary
23
 
24
  **webAI-Official/webAI-ColVec1-9b** is a state-of-the-art [ColBERT](https://arxiv.org/abs/2407.01449)-style multimodal embedding model based on *[Qwen/Qwen3.5-4B](https://huggingface.co/Qwen/Qwen3.5-4B)*. It maps text queries, visual documents (images, PDFs) into aligned multi-vector embeddings.
25
 
26
- The model has been fine-tuned on a **merged multimodal dataset** of ~2M question-image pairs, including:
27
-
28
- - [DocVQA](https://huggingface.co/datasets/lmms-lab/DocVQA)
29
- - [PubTables-1M](https://huggingface.co/datasets/bsmock/pubtables-1m)
30
- - [TAT-QA](https://huggingface.co/datasets/next-tat/TAT-QA)
31
- - [ViDoRe-ColPali-Training](https://huggingface.co/datasets/vidore/colpali_train_set)
32
- - [VDR Multilingual](https://huggingface.co/datasets/llamaindex/vdr-multilingual-train)
33
- - [VisRAG-Ret-Train-In-domain-data](https://huggingface.co/datasets/openbmb/VisRAG-Ret-Train-In-domain-data)
34
- - [VisRAG-Ret-Train-Synthetic-data](https://huggingface.co/datasets/openbmb/VisRAG-Ret-Train-Synthetic-data)
35
- - Proprietary domain-specific synthetic data
36
 
37
  The datasets were filtered, balanced, and merged to produce a comprehensive training set optimized for multilingual, multimodal retrieval and document-image understanding. The model achieves **competitive performance across ViDoRe V1 & V3** (English and multilingual).
38
 
39
  ## 🛠️ Model Specifications
40
 
41
 
42
- | Feature | Detail |
43
- | --------------------- | -------------------------------------------------------------------------- |
44
  | **Architecture** | Qwen3.5-4B Vision-Language Model (VLM) + `2560 dim` Linear Projection Head |
45
- | **Methodology** | ColBERT-style Late Interaction (MaxSim scoring) |
46
  | **Output** | Multi-vector (Seq_Len × *2560*), L2-normalized |
47
- | **Modalities** | Text Queries, Images (Documents) |
48
- | **Training Strategy** | LoRA adapters + Fully-trained projection layer |
49
- | **Precision** | `bfloat16` weights, FlashAttention 2 enabled |
50
 
51
 
52
  ---
@@ -68,9 +60,28 @@ We report results on the **ViDoRe** benchmark suite. The tables below summarize
68
 
69
  ### ViDoRe V3 (NDCG@10)
70
 
 
 
 
 
 
 
 
 
 
71
 
72
  ### ViDoRe V1 (NDCG@5)
73
 
 
 
 
 
 
 
 
 
 
 
74
 
75
  ---
76
 
@@ -176,13 +187,11 @@ model = AutoModel.from_pretrained(
176
  # Sample Data
177
  queries = [
178
  "Retrieve the city of Singapore",
179
- "Retrieve the city of Beijing",
180
- "Retrieve the city of London",
181
  ]
182
  docs = [
183
  "https://upload.wikimedia.org/wikipedia/commons/2/27/Singapore_skyline_2022.jpg",
184
- "https://upload.wikimedia.org/wikipedia/commons/6/61/Beijing_skyline_at_night.JPG",
185
- "https://upload.wikimedia.org/wikipedia/commons/4/49/London_skyline.jpg",
186
  ]
187
 
188
  def load_image(url: str) -> Image.Image:
@@ -249,7 +258,7 @@ print(scores)
249
 
250
  ### License & Data
251
 
252
- [LICENSE](https://huggingface.co/webAI-Official/webAI-ColVec1-4b/blob/main/LICENSE.md)
253
 
254
  ## 📚 Citation
255
 
@@ -262,4 +271,4 @@ If you use this model, please cite:
262
  year={2026},
263
  url={https://huggingface.co/webAI-Official/webAI-ColVec1-9b}
264
  }
265
- ```
 
17
  - colqwen3_5
18
  - multilingual-embedding
19
  ---
20
+
21
  # webAI-Official/webAI-ColVec1-9b
22
 
23
  ## ⚡ Summary
24
 
25
  **webAI-Official/webAI-ColVec1-9b** is a state-of-the-art [ColBERT](https://arxiv.org/abs/2407.01449)-style multimodal embedding model based on *[Qwen/Qwen3.5-4B](https://huggingface.co/Qwen/Qwen3.5-4B)*. It maps text queries, visual documents (images, PDFs) into aligned multi-vector embeddings.
26
 
27
+ The model has been fine-tuned on a **merged multimodal dataset** of ~2M question-image pairs, including [DocVQA](https://huggingface.co/datasets/lmms-lab/DocVQA), [PubTables-1M](https://huggingface.co/datasets/bsmock/pubtables-1m), [TAT-QA](https://huggingface.co/datasets/next-tat/TAT-QA), [ViDoRe-ColPali-Training](https://huggingface.co/datasets/vidore/colpali_train_set), [VDR Multilingual](https://huggingface.co/datasets/llamaindex/vdr-multilingual-train), [VisRAG-Ret-Train-In-domain-data](https://huggingface.co/datasets/openbmb/VisRAG-Ret-Train-In-domain-data), [VisRAG-Ret-Train-Synthetic-data](https://huggingface.co/datasets/openbmb/VisRAG-Ret-Train-Synthetic-data) and proprietary domain-specific synthetic data
 
 
 
 
 
 
 
 
 
28
 
29
  The datasets were filtered, balanced, and merged to produce a comprehensive training set optimized for multilingual, multimodal retrieval and document-image understanding. The model achieves **competitive performance across ViDoRe V1 & V3** (English and multilingual).
30
 
31
  ## 🛠️ Model Specifications
32
 
33
 
34
+ | Feature | Detail |
35
+ | --------------------- | ------------------------------------------------------------------------- |
36
  | **Architecture** | Qwen3.5-4B Vision-Language Model (VLM) + `2560 dim` Linear Projection Head |
37
+ | **Methodology** | ColBERT-style Late Interaction (MaxSim scoring) |
38
  | **Output** | Multi-vector (Seq_Len × *2560*), L2-normalized |
39
+ | **Modalities** | Text Queries, Images (Documents) |
40
+ | **Training Strategy** | LoRA adapters + Fully-trained projection layer |
41
+ | **Precision** | `bfloat16` weights, FlashAttention 2 enabled |
42
 
43
 
44
  ---
 
60
 
61
  ### ViDoRe V3 (NDCG@10)
62
 
63
+ | Model | CompSci | Energy | FinanceEn | FinanceFr | HR | Industrial | Pharma | Physics | **Avg (Public)** |
64
+ | :--- | :--- | :--- | :--- | :--- | :--- | :--- | :--- | :--- | :--- |
65
+ | **[webAI-Vault1-9b](https://huggingface.co/webAI-Official/webAI-ColVec1-9b)** | **0.8092** | 0.6976 | 0.6827 | **0.5372** | **0.7004** | **0.5718** | **0.6732** | 0.4838 | **0.6445** |
66
+ | [nemotron-colembed-vl-8b-v2](https://huggingface.co/nvidia/nemotron-colembed-vl-8b-v2) | 0.7929 | **0.6982** | 0.6729 | 0.5154 | 0.6632 | 0.5603 | 0.6719 | **0.5084** | 0.6354 |
67
+ | **[webAI-Vault1-4b](https://huggingface.co/webAI-Official/webAI-ColVec1-4b)** | 0.7983 | 0.6869 | **0.6848** | 0.5111 | 0.6739 | 0.5573 | 0.6567 | 0.5014 | 0.6338 |
68
+ | [tomoro-colqwen3-embed-8b](https://huggingface.co/TomoroAI/tomoro-colqwen3-embed-8b) | 0.7535 | 0.6841 | 0.6508 | 0.4910 | 0.6398 | 0.5441 | 0.6636 | 0.5013 | 0.6160 |
69
+ | [colqwen3.5-4.5B-v3](https://huggingface.co/athrael-soju/colqwen3.5-4.5B-v3) | 0.7866 | 0.6804 | 0.6406 | 0.4856 | 0.6206 | 0.5520 | 0.6559 | 0.5034 | 0.6156 |
70
+
71
+
72
 
73
  ### ViDoRe V1 (NDCG@5)
74
 
75
+ | Model | ArxivQA | DocVQA | InfoVQA | Shift | Syn-AI | Syn-Eng | Syn-Gov | Syn-Health | TabFQuAD | Tatdqa | **Avg** |
76
+ | :------------------------------------------------------------------------------------------------- | :--------- | :--------- | :--------- | :--------- | :--------- | :--------- | :--------- | :--------- | :--------- | :--------- | :--------- |
77
+ | [nemotron-colembed-vl-8b-v2](https://huggingface.co/nvidia/nemotron-colembed-vl-8b-v2) | 0.9310 | 0.6810 | 0.9460 | **0.9330** | **1.0000** | 0.9790 | **0.9890** | 0.9960 | 0.9770 | 0.8340 | **0.9270** |
78
+ | [llama-nemotron-colembed-vl-3b-v2](https://huggingface.co/nvidia/llama-nemotron-colembed-vl-3b-v2) | 0.9040 | 0.6720 | 0.9470 | 0.9200 | **1.0000** | **0.9800** | 0.9800 | 0.9890 | 0.9730 | 0.8100 | 0.9170 |
79
+ | [nemotron-colembed-vl-4b-v2](https://huggingface.co/nvidia/nemotron-colembed-vl-4b-v2) | 0.9200 | 0.6740 | 0.9330 | 0.9230 | 0.9930 | 0.9620 | 0.9800 | 0.9850 | **0.9810** | 0.8120 | 0.9160 |
80
+ | [colqwen3.5-4.5B-v3](https://huggingface.co/athrael-soju/colqwen3.5-4.5B-v3) | 0.9190 | 0.6660 | 0.9360 | 0.9020 | **1.0000** | 0.9710 | 0.9730 | 0.9890 | 0.9590 | **0.8400** | 0.9150 |
81
+ | **[webAI-Vault1-9b](TODO)** | **0.9413** | **0.6882** | **0.9505** | 0.8758 | 0.9963 | 0.9739 | 0.9839 | 0.9926 | 0.9460 | 0.7956 | 0.9144 |
82
+ | [Ops-Colqwen3-4B](https://huggingface.co/OpenSearch-AI/Ops-Colqwen3-4B) | 0.9180 | 0.6650 | 0.9400 | 0.9080 | 0.9960 | 0.9730 | 0.9800 | 0.9960 | 0.9360 | 0.8240 | 0.9140 |
83
+ | **[SauerkrautLM-ColQwen3-8b-v0.1](https://huggingface.co/VAGOsolutions/SauerkrautLM-ColQwen3-8b-v0.1)** | 0.9380 | 0.6470 | 0.9450 | 0.9040 | 0.9860 | 0.9650 | 0.9680 | 0.9930 | 0.9220 | 0.8400 | 0.9110 |
84
+ | **[webAI-Vault1-4b](TODO)** | 0.9258 | 0.6773 | 0.9412 | 0.8764 | **1.0000** | 0.9703 | 0.9721 | **1.0000** | 0.9414 | 0.7950 | 0.9100 |
85
 
86
  ---
87
 
 
187
  # Sample Data
188
  queries = [
189
  "Retrieve the city of Singapore",
190
+ "Retrieve the city of Beijing"
 
191
  ]
192
  docs = [
193
  "https://upload.wikimedia.org/wikipedia/commons/2/27/Singapore_skyline_2022.jpg",
194
+ "https://upload.wikimedia.org/wikipedia/commons/6/61/Beijing_skyline_at_night.JPG"
 
195
  ]
196
 
197
  def load_image(url: str) -> Image.Image:
 
258
 
259
  ### License & Data
260
 
261
+ [LICENSE](https://huggingface.co/webAI-Official/webAI-ColVec1-9b/blob/main/LICENSE.md)
262
 
263
  ## 📚 Citation
264
 
 
271
  year={2026},
272
  url={https://huggingface.co/webAI-Official/webAI-ColVec1-9b}
273
  }
274
+ ```