banner

ColTurk-VDR-Qwen3VL-4B v1.0

ColBERT-style late-interaction visual document retriever built on Qwen/Qwen3-VL-4B-Instruct with the colpali-engine ColQwen3 architecture (transformers v5 native). Pages are embedded as multi-vector 128-dim patch/token embeddings; queries and documents are scored with MaxSim.

This repository contains the merged full model (LoRA weights baked into the base) β€” it loads directly with ColQwen3.from_pretrained, with no PEFT step and no adapter key-prefix fragility across transformers versions. The original LoRA adapter is preserved under adapter/ for reproducibility.

Results β€” ViDoRe V3 (8 public subtasks)

Evaluated on the full corpus with all queries per subtask (no sampling), MaxSim scoring, processor-default visual tokens, seeded bootstrap 95% CI. Raw JSONs: eval/results/.

Mean NDCG@10 = 0.5584 Β· NDCG@5 = 0.5287 Β· recall@10 = 0.6110

Subtask NDCG@10 95% CI n_queries n_corpus
Vidore3ComputerScienceRetrieval 0.7306 [0.718, 0.743] 1290 1360
Vidore3EnergyRetrieval 0.6238 [0.608, 0.638] 1848 2225
Vidore3PharmaceuticalsRetrieval 0.6156 [0.602, 0.629] 2184 2313
Vidore3FinanceEnRetrieval 0.5851 [0.571, 0.601] 1854 2942
Vidore3HrRetrieval 0.5463 [0.532, 0.560] 1908 1110
Vidore3IndustrialRetrieval 0.4624 [0.445, 0.482] 1698 5244
Vidore3PhysicsRetrieval 0.4564 [0.443, 0.471] 1812 1674
Vidore3FinanceFrRetrieval 0.4467 [0.430, 0.463] 1920 2384

Usage

import torch
from colpali_engine.models import ColQwen3, ColQwen3Processor

model_id = "Verm1ion/ColTurk-VDR-Qwen3VL-4B-v1.0"
model = ColQwen3.from_pretrained(
    model_id, torch_dtype=torch.bfloat16, device_map="cuda:0",
    attn_implementation="sdpa",
).eval()
processor = ColQwen3Processor.from_pretrained(model_id)

# documents: list[PIL.Image] of page images; queries: list[str]
doc_batch = processor.process_images(documents).to(model.device)
qry_batch = processor.process_queries(queries).to(model.device)
with torch.no_grad():
    doc_emb = model(**doc_batch)
    qry_emb = model(**qry_batch)
scores = processor.score_multi_vector(qry_emb, doc_emb)   # (n_queries, n_docs)

Requirements: colpali-engine>=0.3.16, transformers>=5.0, torch>=2.5.

Training

Base Qwen/Qwen3-VL-4B-Instruct (raw, no warm start)
Method LoRA r=32, Ξ±=32, dropout 0.1 on language-model proj layers; custom_text_proj head fully trained
Data manu/colpali EN+FR, 108K query–page pairs, 2 mined hard negatives per query (K=2)
Loss ColBERT pairwise negative CE (in-batch + explicit negatives)
Schedule LR 5e-5, linear decay, warmup 10, effective batch 32, bf16, gradient checkpointing, max_num_visual_tokens=768 (training)
Hardware single A100 80GB
Selection eval-gated checkpoint curve on the full benchmark: step 500 β†’ 0.5441, step 1000 β†’ 0.5584 (peak, released), step 1500 β†’ 0.5518 (overfit onset)

Measured negative results (transparency)

Each candidate improvement was evaluated on the full benchmark and dropped on evidence: more negatives (K=4: βˆ’0.016, worse on 8/8 subtasks), two-run weight averaging (βˆ’0.006, zero synergy across LoRA inits), train-matched visual-token cap at eval (βˆ’0.017; uncapped inference is better). Full validity report (causal control, leakage tripwires, pHash contamination scan, bootstrap CIs): STAGE1_VALIDITY_REPORT.md.

Evaluation protocol & reproduction

git clone https://github.com/Verm1lion/ColTurk-VDR
cd ColTurk-VDR
python scripts/eval/eval_colturk_checkpoint.py \
    --adapter Verm1ion/ColTurk-VDR-Qwen3VL-4B-v1.0 \
    --bootstrap 1000 --output eval/results/repro.json

Environment pins and seeds: REPRODUCIBILITY.md. Training data ↔ benchmark contamination was checked empirically (perceptual-hash scan over train images Γ— V3 corpora: 0 exact duplicates, 0.025% at the document true-duplicate bar, visually inspected) β€” details in the validity report.

Limitations

  • Trained on 108K EN+FR pairs (single-GPU budget) β€” well below the multi-million-pair data scale of the top ViDoRe V3 entries; scores reflect that gap honestly.
  • English and French document domains only in v1.0; Turkish document support is the next planned stage.
  • Retrieval-only model: no reranking, no generation.

Citation

@misc{karatay2026colturkvdr,
  author = {Karatay, Mert},
  title  = {ColTurk-VDR: A Late-Interaction Visual Document Retriever on Qwen3-VL-4B},
  year   = {2026},
  url    = {https://github.com/Verm1lion/ColTurk-VDR}
}
Downloads last month
-
Safetensors
Model size
4B params
Tensor type
BF16
Β·
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for Verm1ion/ColTurk-VDR-Qwen3VL-4B-v1.0

Finetuned
(302)
this model

Datasets used to train Verm1ion/ColTurk-VDR-Qwen3VL-4B-v1.0

Space using Verm1ion/ColTurk-VDR-Qwen3VL-4B-v1.0 1

Collection including Verm1ion/ColTurk-VDR-Qwen3VL-4B-v1.0