README update

Browse files

Files changed (1) hide show

README.md +31 -22

README.md CHANGED Viewed

@@ -17,36 +17,28 @@ tags:
   - colqwen3_5
   - multilingual-embedding
 ---
 # webAI-Official/webAI-ColVec1-9b
 ## ⚡ Summary
 **webAI-Official/webAI-ColVec1-9b** is a state-of-the-art [ColBERT](https://arxiv.org/abs/2407.01449)-style multimodal embedding model based on *[Qwen/Qwen3.5-4B](https://huggingface.co/Qwen/Qwen3.5-4B)*. It maps text queries, visual documents (images, PDFs) into aligned multi-vector embeddings.
-The model has been fine-tuned on a **merged multimodal dataset** of ~2M question-image pairs, including:
-- [DocVQA](https://huggingface.co/datasets/lmms-lab/DocVQA)
-- [PubTables-1M](https://huggingface.co/datasets/bsmock/pubtables-1m)
-- [TAT-QA](https://huggingface.co/datasets/next-tat/TAT-QA)
-- [ViDoRe-ColPali-Training](https://huggingface.co/datasets/vidore/colpali_train_set)
-- [VDR Multilingual](https://huggingface.co/datasets/llamaindex/vdr-multilingual-train)
-- [VisRAG-Ret-Train-In-domain-data](https://huggingface.co/datasets/openbmb/VisRAG-Ret-Train-In-domain-data)
-- [VisRAG-Ret-Train-Synthetic-data](https://huggingface.co/datasets/openbmb/VisRAG-Ret-Train-Synthetic-data)
-- Proprietary domain-specific synthetic data
 The datasets were filtered, balanced, and merged to produce a comprehensive training set optimized for multilingual, multimodal retrieval and document-image understanding. The model achieves **competitive performance across ViDoRe V1 & V3** (English and multilingual).
 ## 🛠️ Model Specifications
-| Feature               | Detail                                                                     |
-| --------------------- | -------------------------------------------------------------------------- |
 | **Architecture**      | Qwen3.5-4B Vision-Language Model (VLM) + `2560 dim` Linear Projection Head |
-| **Methodology**       | ColBERT-style Late Interaction (MaxSim scoring)                            |
 | **Output**            | Multi-vector (Seq_Len × *2560*), L2-normalized                             |
-| **Modalities**        | Text Queries, Images (Documents)                                           |
-| **Training Strategy** | LoRA adapters + Fully-trained projection layer                             |
-| **Precision**         | `bfloat16` weights, FlashAttention 2 enabled                               |
 ---
@@ -68,9 +60,28 @@ We report results on the **ViDoRe** benchmark suite. The tables below summarize
 ### ViDoRe V3 (NDCG@10)
 ### ViDoRe V1 (NDCG@5)
 ---
@@ -176,13 +187,11 @@ model = AutoModel.from_pretrained(
 # Sample Data
 queries = [
     "Retrieve the city of Singapore",
-    "Retrieve the city of Beijing",
-    "Retrieve the city of London",
 ]
 docs = [
     "https://upload.wikimedia.org/wikipedia/commons/2/27/Singapore_skyline_2022.jpg",
-    "https://upload.wikimedia.org/wikipedia/commons/6/61/Beijing_skyline_at_night.JPG",
-    "https://upload.wikimedia.org/wikipedia/commons/4/49/London_skyline.jpg",
 ]
 def load_image(url: str) -> Image.Image:
@@ -249,7 +258,7 @@ print(scores)
 ### License & Data
-[LICENSE](https://huggingface.co/webAI-Official/webAI-ColVec1-4b/blob/main/LICENSE.md)
 ## 📚 Citation
@@ -262,4 +271,4 @@ If you use this model, please cite:
   year={2026},
   url={https://huggingface.co/webAI-Official/webAI-ColVec1-9b}
 }
-```

   - colqwen3_5
   - multilingual-embedding
 ---
 # webAI-Official/webAI-ColVec1-9b
 ## ⚡ Summary
 **webAI-Official/webAI-ColVec1-9b** is a state-of-the-art [ColBERT](https://arxiv.org/abs/2407.01449)-style multimodal embedding model based on *[Qwen/Qwen3.5-4B](https://huggingface.co/Qwen/Qwen3.5-4B)*. It maps text queries, visual documents (images, PDFs) into aligned multi-vector embeddings.
+The model has been fine-tuned on a **merged multimodal dataset** of ~2M question-image pairs, including [DocVQA](https://huggingface.co/datasets/lmms-lab/DocVQA), [PubTables-1M](https://huggingface.co/datasets/bsmock/pubtables-1m), [TAT-QA](https://huggingface.co/datasets/next-tat/TAT-QA), [ViDoRe-ColPali-Training](https://huggingface.co/datasets/vidore/colpali_train_set), [VDR Multilingual](https://huggingface.co/datasets/llamaindex/vdr-multilingual-train), [VisRAG-Ret-Train-In-domain-data](https://huggingface.co/datasets/openbmb/VisRAG-Ret-Train-In-domain-data), [VisRAG-Ret-Train-Synthetic-data](https://huggingface.co/datasets/openbmb/VisRAG-Ret-Train-Synthetic-data) and proprietary domain-specific synthetic data
 The datasets were filtered, balanced, and merged to produce a comprehensive training set optimized for multilingual, multimodal retrieval and document-image understanding. The model achieves **competitive performance across ViDoRe V1 & V3** (English and multilingual).
 ## 🛠️ Model Specifications
+| Feature               | Detail                                                                    |
+| --------------------- | ------------------------------------------------------------------------- |
 | **Architecture**      | Qwen3.5-4B Vision-Language Model (VLM) + `2560 dim` Linear Projection Head |
+| **Methodology**       | ColBERT-style Late Interaction (MaxSim scoring)                           |
 | **Output**            | Multi-vector (Seq_Len × *2560*), L2-normalized                             |
+| **Modalities**        | Text Queries, Images (Documents)                                          |
+| **Training Strategy** | LoRA adapters + Fully-trained projection layer                            |
+| **Precision**         | `bfloat16` weights, FlashAttention 2 enabled                              |
 ---
 ### ViDoRe V3 (NDCG@10)
+| Model | CompSci | Energy | FinanceEn | FinanceFr | HR | Industrial | Pharma | Physics | **Avg (Public)** |
+| :--- | :--- | :--- | :--- | :--- | :--- | :--- | :--- | :--- | :--- |
+| **[webAI-Vault1-9b](https://huggingface.co/webAI-Official/webAI-ColVec1-9b)** | **0.8092** | 0.6976 | 0.6827 | **0.5372** | **0.7004** | **0.5718** | **0.6732** | 0.4838 | **0.6445** |
+| [nemotron-colembed-vl-8b-v2](https://huggingface.co/nvidia/nemotron-colembed-vl-8b-v2) | 0.7929 | **0.6982** | 0.6729 | 0.5154 | 0.6632 | 0.5603 | 0.6719 | **0.5084** | 0.6354 |
+| **[webAI-Vault1-4b](https://huggingface.co/webAI-Official/webAI-ColVec1-4b)** | 0.7983 | 0.6869 | **0.6848** | 0.5111 | 0.6739 | 0.5573 | 0.6567 | 0.5014 | 0.6338 |
+| [tomoro-colqwen3-embed-8b](https://huggingface.co/TomoroAI/tomoro-colqwen3-embed-8b) | 0.7535 | 0.6841 | 0.6508 | 0.4910 | 0.6398 | 0.5441 | 0.6636 | 0.5013 | 0.6160 |
+| [colqwen3.5-4.5B-v3](https://huggingface.co/athrael-soju/colqwen3.5-4.5B-v3) | 0.7866 | 0.6804 | 0.6406 | 0.4856 | 0.6206 | 0.5520 | 0.6559 | 0.5034 | 0.6156 |
 ### ViDoRe V1 (NDCG@5)
+| Model                                                                                              | ArxivQA    | DocVQA     | InfoVQA    | Shift      | Syn-AI     | Syn-Eng    | Syn-Gov    | Syn-Health | TabFQuAD   | Tatdqa     | **Avg**    |
+| :------------------------------------------------------------------------------------------------- | :--------- | :--------- | :--------- | :--------- | :--------- | :--------- | :--------- | :--------- | :--------- | :--------- | :--------- |
+| [nemotron-colembed-vl-8b-v2](https://huggingface.co/nvidia/nemotron-colembed-vl-8b-v2)         | 0.9310 | 0.6810 | 0.9460     | **0.9330** | **1.0000** | 0.9790     | **0.9890** | 0.9960 | 0.9770 | 0.8340     | **0.9270** |
+| [llama-nemotron-colembed-vl-3b-v2](https://huggingface.co/nvidia/llama-nemotron-colembed-vl-3b-v2) | 0.9040     | 0.6720     | 0.9470 | 0.9200     | **1.0000** | **0.9800** | 0.9800     | 0.9890     | 0.9730     | 0.8100     | 0.9170     |
+| [nemotron-colembed-vl-4b-v2](https://huggingface.co/nvidia/nemotron-colembed-vl-4b-v2)             | 0.9200     | 0.6740     | 0.9330     | 0.9230     | 0.9930     | 0.9620     | 0.9800     | 0.9850     | **0.9810** | 0.8120     | 0.9160     |
+| [colqwen3.5-4.5B-v3](https://huggingface.co/athrael-soju/colqwen3.5-4.5B-v3)                       | 0.9190     | 0.6660     | 0.9360     | 0.9020     | **1.0000** | 0.9710     | 0.9730     | 0.9890     | 0.9590     | **0.8400** | 0.9150     |
+| **[webAI-Vault1-9b](TODO)**         | **0.9413** | **0.6882** | **0.9505**     | 0.8758 | 0.9963 | 0.9739     | 0.9839 | 0.9926 | 0.9460 | 0.7956     | 0.9144 |
+| [Ops-Colqwen3-4B](https://huggingface.co/OpenSearch-AI/Ops-Colqwen3-4B)                            | 0.9180     | 0.6650     | 0.9400     | 0.9080     | 0.9960     | 0.9730     | 0.9800     | 0.9960 | 0.9360     | 0.8240     | 0.9140     |
+| **[SauerkrautLM-ColQwen3-8b-v0.1](https://huggingface.co/VAGOsolutions/SauerkrautLM-ColQwen3-8b-v0.1)**         | 0.9380 | 0.6470 | 0.9450     | 0.9040 | 0.9860 | 0.9650     | 0.9680 | 0.9930 | 0.9220 | 0.8400     | 0.9110 |
+| **[webAI-Vault1-4b](TODO)**         | 0.9258 | 0.6773 | 0.9412     | 0.8764 | **1.0000** | 0.9703     | 0.9721 | **1.0000** | 0.9414 | 0.7950     | 0.9100 |
 ---
 # Sample Data
 queries = [
     "Retrieve the city of Singapore",
+    "Retrieve the city of Beijing"
 ]
 docs = [
     "https://upload.wikimedia.org/wikipedia/commons/2/27/Singapore_skyline_2022.jpg",
+    "https://upload.wikimedia.org/wikipedia/commons/6/61/Beijing_skyline_at_night.JPG"
 ]
 def load_image(url: str) -> Image.Image:
 ### License & Data
+[LICENSE](https://huggingface.co/webAI-Official/webAI-ColVec1-9b/blob/main/LICENSE.md)
 ## 📚 Citation
   year={2026},
   url={https://huggingface.co/webAI-Official/webAI-ColVec1-9b}
 }
+```