Instructions to use Verm1ion/ColTurk-VDR-Qwen3VL-4B-v1.0 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- ColPali
How to use Verm1ion/ColTurk-VDR-Qwen3VL-4B-v1.0 with ColPali:
# No code snippets available yet for this library. # To use this model, check the repository files and the library's documentation. # Want to help? PRs adding snippets are welcome at: # https://github.com/huggingface/huggingface.js
- Notebooks
- Google Colab
- Kaggle
ColTurk-VDR-Qwen3VL-4B v1.0
ColBERT-style late-interaction visual document retriever built on Qwen/Qwen3-VL-4B-Instruct with the colpali-engine ColQwen3 architecture (transformers v5 native). Pages are embedded as multi-vector 128-dim patch/token embeddings; queries and documents are scored with MaxSim.
This repository contains the merged full model (LoRA weights baked into the base) β it loads directly with ColQwen3.from_pretrained, with no PEFT step and no adapter key-prefix fragility across transformers versions. The original LoRA adapter is preserved under adapter/ for reproducibility.
- Developed by: Mert Karatay (merttkaratayy@gmail.com)
- Model type: multi-vector late-interaction visual retriever (ColBERT/MaxSim)
- Languages: English + French (training data); query side inherits Qwen3-VL multilinguality
- License: Apache-2.0 (inherited from the base model; training code MIT)
- Repository / eval code: https://github.com/Verm1lion/ColTurk-VDR
Results β ViDoRe V3 (8 public subtasks)
Evaluated on the full corpus with all queries per subtask (no sampling), MaxSim scoring, processor-default visual tokens, seeded bootstrap 95% CI. Raw JSONs: eval/results/.
Mean NDCG@10 = 0.5584 Β· NDCG@5 = 0.5287 Β· recall@10 = 0.6110
| Subtask | NDCG@10 | 95% CI | n_queries | n_corpus |
|---|---|---|---|---|
| Vidore3ComputerScienceRetrieval | 0.7306 | [0.718, 0.743] | 1290 | 1360 |
| Vidore3EnergyRetrieval | 0.6238 | [0.608, 0.638] | 1848 | 2225 |
| Vidore3PharmaceuticalsRetrieval | 0.6156 | [0.602, 0.629] | 2184 | 2313 |
| Vidore3FinanceEnRetrieval | 0.5851 | [0.571, 0.601] | 1854 | 2942 |
| Vidore3HrRetrieval | 0.5463 | [0.532, 0.560] | 1908 | 1110 |
| Vidore3IndustrialRetrieval | 0.4624 | [0.445, 0.482] | 1698 | 5244 |
| Vidore3PhysicsRetrieval | 0.4564 | [0.443, 0.471] | 1812 | 1674 |
| Vidore3FinanceFrRetrieval | 0.4467 | [0.430, 0.463] | 1920 | 2384 |
Usage
import torch
from colpali_engine.models import ColQwen3, ColQwen3Processor
model_id = "Verm1ion/ColTurk-VDR-Qwen3VL-4B-v1.0"
model = ColQwen3.from_pretrained(
model_id, torch_dtype=torch.bfloat16, device_map="cuda:0",
attn_implementation="sdpa",
).eval()
processor = ColQwen3Processor.from_pretrained(model_id)
# documents: list[PIL.Image] of page images; queries: list[str]
doc_batch = processor.process_images(documents).to(model.device)
qry_batch = processor.process_queries(queries).to(model.device)
with torch.no_grad():
doc_emb = model(**doc_batch)
qry_emb = model(**qry_batch)
scores = processor.score_multi_vector(qry_emb, doc_emb) # (n_queries, n_docs)
Requirements: colpali-engine>=0.3.16, transformers>=5.0, torch>=2.5.
Training
| Base | Qwen/Qwen3-VL-4B-Instruct (raw, no warm start) |
| Method | LoRA r=32, Ξ±=32, dropout 0.1 on language-model proj layers; custom_text_proj head fully trained |
| Data | manu/colpali EN+FR, 108K queryβpage pairs, 2 mined hard negatives per query (K=2) |
| Loss | ColBERT pairwise negative CE (in-batch + explicit negatives) |
| Schedule | LR 5e-5, linear decay, warmup 10, effective batch 32, bf16, gradient checkpointing, max_num_visual_tokens=768 (training) |
| Hardware | single A100 80GB |
| Selection | eval-gated checkpoint curve on the full benchmark: step 500 β 0.5441, step 1000 β 0.5584 (peak, released), step 1500 β 0.5518 (overfit onset) |
Measured negative results (transparency)
Each candidate improvement was evaluated on the full benchmark and dropped on evidence: more negatives (K=4: β0.016, worse on 8/8 subtasks), two-run weight averaging (β0.006, zero synergy across LoRA inits), train-matched visual-token cap at eval (β0.017; uncapped inference is better). Full validity report (causal control, leakage tripwires, pHash contamination scan, bootstrap CIs): STAGE1_VALIDITY_REPORT.md.
Evaluation protocol & reproduction
git clone https://github.com/Verm1lion/ColTurk-VDR
cd ColTurk-VDR
python scripts/eval/eval_colturk_checkpoint.py \
--adapter Verm1ion/ColTurk-VDR-Qwen3VL-4B-v1.0 \
--bootstrap 1000 --output eval/results/repro.json
Environment pins and seeds: REPRODUCIBILITY.md. Training data β benchmark contamination was checked empirically (perceptual-hash scan over train images Γ V3 corpora: 0 exact duplicates, 0.025% at the document true-duplicate bar, visually inspected) β details in the validity report.
Limitations
- Trained on 108K EN+FR pairs (single-GPU budget) β well below the multi-million-pair data scale of the top ViDoRe V3 entries; scores reflect that gap honestly.
- English and French document domains only in v1.0; Turkish document support is the next planned stage.
- Retrieval-only model: no reranking, no generation.
Citation
@misc{karatay2026colturkvdr,
author = {Karatay, Mert},
title = {ColTurk-VDR: A Late-Interaction Visual Document Retriever on Qwen3-VL-4B},
year = {2026},
url = {https://github.com/Verm1lion/ColTurk-VDR}
}
- Downloads last month
- -
Model tree for Verm1ion/ColTurk-VDR-Qwen3VL-4B-v1.0
Base model
Qwen/Qwen3-VL-4B-Instruct