README Update
Browse files
README.md
CHANGED
|
@@ -24,16 +24,7 @@ tags:
|
|
| 24 |
|
| 25 |
**webAI-Official/webAI-ColVec1-4b** is a state-of-the-art [ColBERT](https://arxiv.org/abs/2407.01449)-style multimodal embedding model based on *[Qwen/Qwen3.5-4B](https://huggingface.co/Qwen/Qwen3.5-4B)*. It maps text queries, visual documents (images, PDFs) into aligned multi-vector embeddings.
|
| 26 |
|
| 27 |
-
The model has been fine-tuned on a **merged multimodal dataset** of ~2M question-image pairs, including:
|
| 28 |
-
|
| 29 |
-
- [DocVQA](https://huggingface.co/datasets/lmms-lab/DocVQA)
|
| 30 |
-
- [PubTables-1M](https://huggingface.co/datasets/bsmock/pubtables-1m)
|
| 31 |
-
- [TAT-QA](https://huggingface.co/datasets/next-tat/TAT-QA)
|
| 32 |
-
- [ViDoRe-ColPali-Training](https://huggingface.co/datasets/vidore/colpali_train_set)
|
| 33 |
-
- [VDR Multilingual](https://huggingface.co/datasets/llamaindex/vdr-multilingual-train)
|
| 34 |
-
- [VisRAG-Ret-Train-In-domain-data](https://huggingface.co/datasets/openbmb/VisRAG-Ret-Train-In-domain-data)
|
| 35 |
-
- [VisRAG-Ret-Train-Synthetic-data](https://huggingface.co/datasets/openbmb/VisRAG-Ret-Train-Synthetic-data)
|
| 36 |
-
- Proprietary domain-specific synthetic data
|
| 37 |
|
| 38 |
The datasets were filtered, balanced, and merged to produce a comprehensive training set optimized for multilingual, multimodal retrieval and document-image understanding. The model achieves **competitive performance across ViDoRe V1 & V3** (English and multilingual).
|
| 39 |
|
|
@@ -69,9 +60,28 @@ We report results on the **ViDoRe** benchmark suite. The tables below summarize
|
|
| 69 |
|
| 70 |
### ViDoRe V3 (NDCG@10)
|
| 71 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 72 |
|
| 73 |
### ViDoRe V1 (NDCG@5)
|
| 74 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 75 |
|
| 76 |
---
|
| 77 |
|
|
@@ -177,13 +187,11 @@ model = AutoModel.from_pretrained(
|
|
| 177 |
# Sample Data
|
| 178 |
queries = [
|
| 179 |
"Retrieve the city of Singapore",
|
| 180 |
-
"Retrieve the city of Beijing"
|
| 181 |
-
"Retrieve the city of London",
|
| 182 |
]
|
| 183 |
docs = [
|
| 184 |
"https://upload.wikimedia.org/wikipedia/commons/2/27/Singapore_skyline_2022.jpg",
|
| 185 |
-
"https://upload.wikimedia.org/wikipedia/commons/6/61/Beijing_skyline_at_night.JPG"
|
| 186 |
-
"https://upload.wikimedia.org/wikipedia/commons/4/49/London_skyline.jpg",
|
| 187 |
]
|
| 188 |
|
| 189 |
def load_image(url: str) -> Image.Image:
|
|
@@ -263,5 +271,4 @@ If you use this model, please cite:
|
|
| 263 |
year={2026},
|
| 264 |
url={https://huggingface.co/webAI-Official/webAI-ColVec1-4b}
|
| 265 |
}
|
| 266 |
-
```
|
| 267 |
-
|
|
|
|
| 24 |
|
| 25 |
**webAI-Official/webAI-ColVec1-4b** is a state-of-the-art [ColBERT](https://arxiv.org/abs/2407.01449)-style multimodal embedding model based on *[Qwen/Qwen3.5-4B](https://huggingface.co/Qwen/Qwen3.5-4B)*. It maps text queries, visual documents (images, PDFs) into aligned multi-vector embeddings.
|
| 26 |
|
| 27 |
+
The model has been fine-tuned on a **merged multimodal dataset** of ~2M question-image pairs, including [DocVQA](https://huggingface.co/datasets/lmms-lab/DocVQA), [PubTables-1M](https://huggingface.co/datasets/bsmock/pubtables-1m), [TAT-QA](https://huggingface.co/datasets/next-tat/TAT-QA), [ViDoRe-ColPali-Training](https://huggingface.co/datasets/vidore/colpali_train_set), [VDR Multilingual](https://huggingface.co/datasets/llamaindex/vdr-multilingual-train), [VisRAG-Ret-Train-In-domain-data](https://huggingface.co/datasets/openbmb/VisRAG-Ret-Train-In-domain-data), [VisRAG-Ret-Train-Synthetic-data](https://huggingface.co/datasets/openbmb/VisRAG-Ret-Train-Synthetic-data) and proprietary domain-specific synthetic data
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 28 |
|
| 29 |
The datasets were filtered, balanced, and merged to produce a comprehensive training set optimized for multilingual, multimodal retrieval and document-image understanding. The model achieves **competitive performance across ViDoRe V1 & V3** (English and multilingual).
|
| 30 |
|
|
|
|
| 60 |
|
| 61 |
### ViDoRe V3 (NDCG@10)
|
| 62 |
|
| 63 |
+
| Model | CompSci | Energy | FinanceEn | FinanceFr | HR | Industrial | Pharma | Physics | **Avg (Public)** |
|
| 64 |
+
| :--- | :--- | :--- | :--- | :--- | :--- | :--- | :--- | :--- | :--- |
|
| 65 |
+
| **[webAI-Vault1-9b](https://huggingface.co/webAI-Official/webAI-ColVec1-9b)** | **0.8092** | 0.6976 | 0.6827 | **0.5372** | **0.7004** | **0.5718** | **0.6732** | 0.4838 | **0.6445** |
|
| 66 |
+
| [nemotron-colembed-vl-8b-v2](https://huggingface.co/nvidia/nemotron-colembed-vl-8b-v2) | 0.7929 | **0.6982** | 0.6729 | 0.5154 | 0.6632 | 0.5603 | 0.6719 | **0.5084** | 0.6354 |
|
| 67 |
+
| **[webAI-Vault1-4b](https://huggingface.co/webAI-Official/webAI-ColVec1-4b)** | 0.7983 | 0.6869 | **0.6848** | 0.5111 | 0.6739 | 0.5573 | 0.6567 | 0.5014 | 0.6338 |
|
| 68 |
+
| [tomoro-colqwen3-embed-8b](https://huggingface.co/TomoroAI/tomoro-colqwen3-embed-8b) | 0.7535 | 0.6841 | 0.6508 | 0.4910 | 0.6398 | 0.5441 | 0.6636 | 0.5013 | 0.6160 |
|
| 69 |
+
| [colqwen3.5-4.5B-v3](https://huggingface.co/athrael-soju/colqwen3.5-4.5B-v3) | 0.7866 | 0.6804 | 0.6406 | 0.4856 | 0.6206 | 0.5520 | 0.6559 | 0.5034 | 0.6156 |
|
| 70 |
+
|
| 71 |
+
|
| 72 |
|
| 73 |
### ViDoRe V1 (NDCG@5)
|
| 74 |
|
| 75 |
+
| Model | ArxivQA | DocVQA | InfoVQA | Shift | Syn-AI | Syn-Eng | Syn-Gov | Syn-Health | TabFQuAD | Tatdqa | **Avg** |
|
| 76 |
+
| :------------------------------------------------------------------------------------------------- | :--------- | :--------- | :--------- | :--------- | :--------- | :--------- | :--------- | :--------- | :--------- | :--------- | :--------- |
|
| 77 |
+
| [nemotron-colembed-vl-8b-v2](https://huggingface.co/nvidia/nemotron-colembed-vl-8b-v2) | 0.9310 | 0.6810 | 0.9460 | **0.9330** | **1.0000** | 0.9790 | **0.9890** | 0.9960 | 0.9770 | 0.8340 | **0.9270** |
|
| 78 |
+
| [llama-nemotron-colembed-vl-3b-v2](https://huggingface.co/nvidia/llama-nemotron-colembed-vl-3b-v2) | 0.9040 | 0.6720 | 0.9470 | 0.9200 | **1.0000** | **0.9800** | 0.9800 | 0.9890 | 0.9730 | 0.8100 | 0.9170 |
|
| 79 |
+
| [nemotron-colembed-vl-4b-v2](https://huggingface.co/nvidia/nemotron-colembed-vl-4b-v2) | 0.9200 | 0.6740 | 0.9330 | 0.9230 | 0.9930 | 0.9620 | 0.9800 | 0.9850 | **0.9810** | 0.8120 | 0.9160 |
|
| 80 |
+
| [colqwen3.5-4.5B-v3](https://huggingface.co/athrael-soju/colqwen3.5-4.5B-v3) | 0.9190 | 0.6660 | 0.9360 | 0.9020 | **1.0000** | 0.9710 | 0.9730 | 0.9890 | 0.9590 | **0.8400** | 0.9150 |
|
| 81 |
+
| **[webAI-Vault1-9b](TODO)** | **0.9413** | **0.6882** | **0.9505** | 0.8758 | 0.9963 | 0.9739 | 0.9839 | 0.9926 | 0.9460 | 0.7956 | 0.9144 |
|
| 82 |
+
| [Ops-Colqwen3-4B](https://huggingface.co/OpenSearch-AI/Ops-Colqwen3-4B) | 0.9180 | 0.6650 | 0.9400 | 0.9080 | 0.9960 | 0.9730 | 0.9800 | 0.9960 | 0.9360 | 0.8240 | 0.9140 |
|
| 83 |
+
| **[SauerkrautLM-ColQwen3-8b-v0.1](https://huggingface.co/VAGOsolutions/SauerkrautLM-ColQwen3-8b-v0.1)** | 0.9380 | 0.6470 | 0.9450 | 0.9040 | 0.9860 | 0.9650 | 0.9680 | 0.9930 | 0.9220 | 0.8400 | 0.9110 |
|
| 84 |
+
| **[webAI-Vault1-4b](TODO)** | 0.9258 | 0.6773 | 0.9412 | 0.8764 | **1.0000** | 0.9703 | 0.9721 | **1.0000** | 0.9414 | 0.7950 | 0.9100 |
|
| 85 |
|
| 86 |
---
|
| 87 |
|
|
|
|
| 187 |
# Sample Data
|
| 188 |
queries = [
|
| 189 |
"Retrieve the city of Singapore",
|
| 190 |
+
"Retrieve the city of Beijing"
|
|
|
|
| 191 |
]
|
| 192 |
docs = [
|
| 193 |
"https://upload.wikimedia.org/wikipedia/commons/2/27/Singapore_skyline_2022.jpg",
|
| 194 |
+
"https://upload.wikimedia.org/wikipedia/commons/6/61/Beijing_skyline_at_night.JPG"
|
|
|
|
| 195 |
]
|
| 196 |
|
| 197 |
def load_image(url: str) -> Image.Image:
|
|
|
|
| 271 |
year={2026},
|
| 272 |
url={https://huggingface.co/webAI-Official/webAI-ColVec1-4b}
|
| 273 |
}
|
| 274 |
+
```
|
|
|