VultronRetrieverFlash-Qwen3.5-0.8B

Sub-2B #1 on ViDoRe V3, at 0.8B and 320 dimensions.

VultronRetrieverFlash is the small tier of the VultronRetriever family, a late-interaction (ColBERT-style) retriever that scores document pages directly from their rendered image: layout, tables, charts and text, across six languages. At 0.8B it leads the sub-2B field on ViDoRe V3 by ~7 points and outscores retrievers several times its size.

The family has three tiers on one 320-dim recipe: the 8B Prime for maximum accuracy, the 4.5B Core for the accuracy/footprint mid-point, and the 0.8B Flash (this model) for latency- and footprint-sensitive serving. Trained and evaluated on Vultr Cloud.

Highlights

  • Sub-2B #1 on ViDoRe V3: 56.49 nDCG@10 at 0.8B, ~7 points clear of the next sub-2B model.
  • Punches above its size: outscores retrievers 3โ€“5ร— its parameter count on V2 and V3.
  • Official MTEB: V1 88.15, V2 60.36, V3 56.49.
  • 0.8B parameters, 320-dim, โ‰ˆ1.6 GB bf16, runs on a single modest GPU.
  • Six languages (en, fr, de, es, it, pt).

ViDoRe leaderboard (ranked by V3)

Ranked by ViDoRe V3 (mean nDCG@10), the headline benchmark; V1 and V2 are shown alongside. Our three tiers are in bold; models that have not reported V3 are listed last.

Model Params Dim V1 V2 V3
VultronRetrieverPrime-Qwen3.5-8B (ours) 8.4B 320 92.08 68.18 64.72
webAI-Official/webAI-ColVec1-9b 9.4B 2560 91.30 65.82 64.45
VultronRetrieverCore-Qwen3.5-4.5B (ours) 4.5B 320 92.21 66.12 63.72
nvidia/nemotron-colembed-vl-8b-v2 8.7B 4096 92.65 65.16 63.54
webAI-Official/webAI-ColVec1-4b 4.5B 640 90.49 63.60 63.39
TomoroAI/tomoro-colqwen3-embed-8b 8.0B 320 90.76 65.40 61.60
athrael-soju/colqwen3.5-4.5B-v3 4.6B 128 91.54 64.25 61.56
nvidia/nemotron-colembed-vl-4b-v2 4.8B 2560 91.62 64.49 61.42
OpenSearch-AI/Ops-Colqwen3-4B 4.8B 2560 91.36 68.66 61.27
TomoroAI/tomoro-colqwen3-embed-4b 4.0B 320 90.57 64.69 60.16
nvidia/llama-nemotron-colembed-vl-3b-v2 4.4B 3072 91.74 63.38 59.70
VAGOsolutions/SauerkrautLM-ColQwen3-8b-v0.1 8.1B 128 91.08 62.47 58.55
nomic-ai/colnomic-embed-multimodal-7b 7.0B 128 89.72 60.25 57.64
jinaai/jina-embeddings-v4 3.9B 2048 90.35 58.23 57.54
nvidia/llama-nemoretriever-colembed-3b-v1 4.4B 3072 91.00 63.32 57.07
VultronRetrieverFlash-Qwen3.5-0.8B (this model) 0.85B 320 88.15 60.36 56.49
nomic-ai/colnomic-embed-multimodal-3b 3.0B 128 89.86 55.68 56.40
VAGOsolutions/SauerkrautLM-ColQwen3-4b-v0.1 4.4B 128 90.80 59.89 56.03
Verm1ion/ColTurk-VDR-Qwen3VL-4B-v1.0 4.5B 128 โ€” โ€” 55.82
nvidia/llama-nemoretriever-colembed-1b-v1 2.4B 2048 90.50 62.96 55.48
vidore/colqwen2.5-v0.2 3.0B 128 89.54 60.06 52.44
DataScience-UIBK/Argus-Colqwen3.5-9b-v0 8.8B 1024 92.67 69.27 โ€”
DataScience-UIBK/Argus-Colqwen3.5-4b-v0 4.7B 1024 92.30 64.18 โ€”

V1/V2: full mean nDCG@5. V3: mean nDCG@10 over the 8 public ViDoRe V3 tasks (the 2 private tasks, Nuclear and Telecom, are pending for all our tiers). Competitor figures: public MTEB ViDoRe leaderboard, 2026-06-21 snapshot; "โ€”" = not reported on that benchmark.

At 0.8B, Flash leads the sub-2B field by ~7 points and outscores several retrievers three to five times its size on both V2 and V3, sitting among models that run 3.9โ€“9B.

Official MTEB results (dim 320 / visual tokens 1792)

Benchmark Metric Tasks Score
ViDoRe V1 ndcg@5 10 0.8815
ViDoRe V2 ndcg@5 4 0.6036
ViDoRe V3 ndcg@10 8 0.5649

Per-task JSONs are in eval_results/. Measured with the official MTEB late-interaction evaluator.

Why 320 dimensions

Late-interaction index size, memory footprint, and MaxSim scoring cost all scale with the embedding dimension. Flash is trained with Matryoshka representation learning and operated at 320-dim, where it leads the sub-2B tier on ViDoRe V3: a 320-dim index is a fraction of the size a 2048-4096-dim retriever carries, for proportionally lower storage, RAM, and query-time compute at serving scale.

Intended use

  • Visual document retrieval / multimodal RAG over PDFs, scans, slides and reports, including pages with layout, tables, charts and figures.
  • Multilingual document collections (en, fr, de, es, it, pt).
  • Latency- and footprint-sensitive serving where a 2โ€“4B retriever is too large: the 0.8B tier of the family.

Out of scope: text-only semantic search, where a single-vector dense embedder is cheaper; generative QA (this is a retriever; pair it with a reader/LLM).

Method

Per-token MaxSim scoring captures fine-grained matches against tables, figures, and layout that a single-vector embedder averages away.

  • Base: Qwen/Qwen3.5-0.8B (hybrid GatedDeltaNet + full-attention backbone).
  • Late-interaction retriever (ColQwen3_5): 320-dim multi-vector embeddings, MaxSim scoring, image + text inputs.
  • Size: 0.8B parameters, the generative head dropped for retrieval.
  • Matryoshka representation learning during training; the shipped checkpoint is the native 320-dim operating point.
  • Model merging: several independently-seeded checkpoints merged per-block into one full-weight checkpoint.
  • Trained at up to 1280 visual tokens, evaluated and deployed at 1792.

Training data

An enhanced, multilingual mixture of public and synthetic visual-document retrieval sources, spanning en, es, de, fr, it and pt, decontaminated against all three ViDoRe suites (V1/V2/V3): 0% measured overlap with the evaluation benchmarks. The training recipe and the assembled training dataset are not distributed in this repository.

Inputs and outputs

  • Input: document-page images (RGB) and/or text queries; pages encode at up to 1792 visual tokens.
  • Output: multi-vector embeddings, one 320-dim vector per token (not a single pooled vector).
  • Scoring: late-interaction MaxSim between query-token and page-token vectors, via score_multi_vector.

Requirements

The Qwen3.5 hybrid (GatedDeltaNet + full-attention) backbone has hard runtime kernel dependencies a vanilla ColQwen / PaliGemma card does not:

pip install "git+https://github.com/illuin-tech/colpali@2e0b927051af727238783af039dcc2c50a4d8c27"
pip install causal-conv1d flash-linear-attention
  • causal-conv1d + flash-linear-attention are required (the hybrid layers import them at runtime).
  • Attention must be SDPA. Retrieval runs bidirectional attention on the full-attention layers; flash_attention_2 silently ignores the 2-D mask and scores as if causal.

Usage

import torch
from PIL import Image
from colpali_engine.models import ColQwen3_5, ColQwen3_5Processor

model = ColQwen3_5.from_pretrained(
    "vultr/VultronRetrieverFlash-Qwen3.5-0.8B",
    torch_dtype=torch.bfloat16,
    attn_implementation="sdpa",   # required (see above)
    device_map="cuda:0",
).eval()
processor = ColQwen3_5Processor.from_pretrained(
    "vultr/VultronRetrieverFlash-Qwen3.5-0.8B",
    max_num_visual_tokens=1792,
)

# Document pages (rendered to images) and text queries
images = [Image.open("page_0.png"), Image.open("page_1.png")]
queries = ["What was Q3 revenue?", "Summarize the safety findings."]

with torch.no_grad():
    doc_emb = model(**processor.process_images(images).to(model.device))
    qry_emb = model(**processor.process_queries(queries).to(model.device))

# Late-interaction MaxSim scoring (feed fp32 to match the eval discipline)
scores = processor.score_multi_vector(qry_emb.float(), doc_emb.float())
# scores[i, j] = relevance of query i to page j
print(scores.shape)  # torch.Size([2, 2])

config.json carries dim=320, so custom_text_proj is sized correctly at load, with no manual config edits needed.

Serving with vLLM

vLLM serves this model natively through its pooling runner (the ColQwen3_5 architecture), returning the per-token multi-vectors for late-interaction scoring. It requires a vLLM build that includes the ColQwen3.5 retrieval-correctness fix (vllm-project/vllm#46108, merged 2026-06-22): build from main, or use a release tagged after that date. The fix runs the backbone bidirectionally and restores the projection bias, so vLLM reproduces the transformers reference within run-to-run noise. The server uses the stock chat/image processor, so the ColQwen3.5 prompt contract is applied client-side: wrap each page image in the instruction template, append the query-augmentation tokens to each query, and set the visual-token budget through mm-processor-kwargs. Prefix caching and chunked prefill must be off (bidirectional attention and the GatedDeltaNet hybrid both break the causal-prefix invariant).

import torch
from PIL import Image
from vllm import LLM
MODEL = "vultr/VultronRetrieverFlash-Qwen3.5-0.8B"
MAX_PIXELS = 1792 * 32 * 32   # max_num_visual_tokens * (patch_size 16 * merge_size 2)^2
llm = LLM(
    model=MODEL,
    runner="pooling",
    dtype="bfloat16",
    enable_prefix_caching=False,
    enable_chunked_prefill=False,
    mm_processor_kwargs={"min_pixels": 65536, "max_pixels": MAX_PIXELS},
)
# ColQwen3.5 processor contract, applied client-side:
IMAGE_PROMPT = ("<|im_start|>user\n<|vision_start|><|image_pad|><|vision_end|>"
                "Describe the image.<|im_end|><|endoftext|>")
def query_prompt(q): return q + "<|endoftext|>" * 10   # query augmentation
images  = [Image.open("page_0.png"), Image.open("page_1.png")]
queries = ["What was Q3 revenue?", "Summarize the safety findings."]
doc_out = llm.encode([{"prompt": IMAGE_PROMPT, "multi_modal_data": {"image": im}}
                      for im in images], pooling_task="token_embed")
qry_out = llm.encode([query_prompt(q) for q in queries], pooling_task="token_embed")
def mv(o):  # one [num_tokens, 320] multi-vector per item, L2-normalized per token
    t = torch.as_tensor(o.outputs.data, dtype=torch.float32)
    return torch.nn.functional.normalize(t, p=2, dim=-1)
docs, qrys = [mv(o) for o in doc_out], [mv(o) for o in qry_out]
# late-interaction MaxSim: per query token take the best doc token, then sum
scores = [[(q @ d.T).max(dim=-1).values.sum().item() for d in docs] for q in qrys]
print(scores)  # scores[i][j] = relevance of query i to page j

To serve over HTTP instead:

vllm serve vultr/VultronRetrieverFlash-Qwen3.5-0.8B \
  --runner pooling \
  --no-enable-prefix-caching --no-enable-chunked-prefill \
  --mm-processor-kwargs '{"min_pixels": 65536, "max_pixels": 1835008}'

Apply the same image template and query augmentation in your client requests. See the upstream example examples/pooling/score/colqwen3_5_rerank_online.py for the full online rerank flow.

Limitations

  • ViDoRe V3 figures cover 8 of the 10 V3 domains; Telecom and Nuclear have not been evaluated yet.
  • Tuned for six languages (en, fr, de, es, it, pt); other languages are out of distribution.
  • Late-interaction multi-vector indexes are larger than single-vector dense indexes: the trade for per-token layout/table/figure sensitivity (small for its class at 320-dim).
  • This is the small tier; for maximum ViDoRe V3 accuracy use the 8B flagship VultronRetrieverPrime-Qwen3.5-8B.

License

Apache 2.0, covering the contents of this repository: model weights, config, and evaluation results. Built on Qwen/Qwen3.5-0.8B (Apache 2.0); the upstream license and attribution are retained. The training recipe and the assembled training dataset are not distributed in this repository.

Citation

@misc{vultronretrieverflash2026,
  title  = {VultronRetrieverFlash-Qwen3.5-0.8B: Small-Tier Late-Interaction Visual Document Retrieval at 320 Dimensions},
  author = {Georgiou, Athos (athrael-soju)},
  year   = {2026},
  howpublished = {\url{https://huggingface.co/vultr/VultronRetrieverFlash-Qwen3.5-0.8B}}
}

Trained and evaluated on Vultr Cloud.

Downloads last month
71
Safetensors
Model size
0.9B params
Tensor type
BF16
ยท
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for vultr/VultronRetrieverFlash-Qwen3.5-0.8B

Finetuned
(252)
this model

Space using vultr/VultronRetrieverFlash-Qwen3.5-0.8B 1

Collection including vultr/VultronRetrieverFlash-Qwen3.5-0.8B