VultronRetrieverFlash-Qwen3.5-0.8B

Sub-2B #1 on ViDoRe V3, at 0.8B and 320 dimensions.

VultronRetrieverFlash is the small tier of the VultronRetriever family, a late-interaction (ColBERT-style) retriever that scores document pages directly from their rendered image: layout, tables, charts and text, across six languages. At 0.8B it leads the sub-2B field on ViDoRe V3 by ~7 points and outscores retrievers several times its size.

The family has three tiers on one 320-dim recipe: the 8B Prime for maximum accuracy, the 4.5B Core for the accuracy/footprint mid-point, and the 0.8B Flash (this model) for latency- and footprint-sensitive serving. Trained and evaluated on Vultr Cloud.

Highlights

Sub-2B #1 on ViDoRe V3: 56.49 nDCG@10 at 0.8B, ~7 points clear of the next sub-2B model.
Punches above its size: outscores retrievers 3–5× its parameter count on V2 and V3.
Official MTEB: V1 88.15, V2 60.36, V3 56.49.
0.8B parameters, 320-dim, ≈1.6 GB bf16, runs on a single modest GPU.
Six languages (en, fr, de, es, it, pt).

ViDoRe leaderboard (ranked by V3)

Ranked by ViDoRe V3 (mean nDCG@10), the headline benchmark; V1 and V2 are shown alongside. Our three tiers are in bold; models that have not reported V3 are listed last.

Model	Params	Dim	V1	V2	V3
VultronRetrieverPrime-Qwen3.5-8B (ours)	8.4B	320	92.08	68.18	64.72
webAI-Official/webAI-ColVec1-9b	9.4B	2560	91.30	65.82	64.45
VultronRetrieverCore-Qwen3.5-4.5B (ours)	4.5B	320	92.21	66.12	63.72
nvidia/nemotron-colembed-vl-8b-v2	8.7B	4096	92.65	65.16	63.54
webAI-Official/webAI-ColVec1-4b	4.5B	640	90.49	63.60	63.39
TomoroAI/tomoro-colqwen3-embed-8b	8.0B	320	90.76	65.40	61.60
athrael-soju/colqwen3.5-4.5B-v3	4.6B	128	91.54	64.25	61.56
nvidia/nemotron-colembed-vl-4b-v2	4.8B	2560	91.62	64.49	61.42
OpenSearch-AI/Ops-Colqwen3-4B	4.8B	2560	91.36	68.66	61.27
TomoroAI/tomoro-colqwen3-embed-4b	4.0B	320	90.57	64.69	60.16
nvidia/llama-nemotron-colembed-vl-3b-v2	4.4B	3072	91.74	63.38	59.70
VAGOsolutions/SauerkrautLM-ColQwen3-8b-v0.1	8.1B	128	91.08	62.47	58.55
nomic-ai/colnomic-embed-multimodal-7b	7.0B	128	89.72	60.25	57.64
jinaai/jina-embeddings-v4	3.9B	2048	90.35	58.23	57.54
nvidia/llama-nemoretriever-colembed-3b-v1	4.4B	3072	91.00	63.32	57.07
VultronRetrieverFlash-Qwen3.5-0.8B (this model)	0.85B	320	88.15	60.36	56.49
nomic-ai/colnomic-embed-multimodal-3b	3.0B	128	89.86	55.68	56.40
VAGOsolutions/SauerkrautLM-ColQwen3-4b-v0.1	4.4B	128	90.80	59.89	56.03
Verm1ion/ColTurk-VDR-Qwen3VL-4B-v1.0	4.5B	128	—	—	55.82
nvidia/llama-nemoretriever-colembed-1b-v1	2.4B	2048	90.50	62.96	55.48
vidore/colqwen2.5-v0.2	3.0B	128	89.54	60.06	52.44
DataScience-UIBK/Argus-Colqwen3.5-9b-v0	8.8B	1024	92.67	69.27	—
DataScience-UIBK/Argus-Colqwen3.5-4b-v0	4.7B	1024	92.30	64.18	—

V1/V2: full mean nDCG@5. V3: mean nDCG@10 over the 8 public ViDoRe V3 tasks (the 2 private tasks, Nuclear and Telecom, are pending for all our tiers). Competitor figures: public MTEB ViDoRe leaderboard, 2026-06-21 snapshot; "—" = not reported on that benchmark.

At 0.8B, Flash leads the sub-2B field by ~7 points and outscores several retrievers three to five times its size on both V2 and V3, sitting among models that run 3.9–9B.

Official MTEB results (dim 320 / visual tokens 1792)

Benchmark	Metric	Tasks	Score
ViDoRe V1	ndcg@5	10	0.8815
ViDoRe V2	ndcg@5	4	0.6036
ViDoRe V3	ndcg@10	8	0.5649

Per-task JSONs are in eval_results/. Measured with the official MTEB late-interaction evaluator.

Why 320 dimensions

Late-interaction index size, memory footprint, and MaxSim scoring cost all scale with the embedding dimension. Flash is trained with Matryoshka representation learning and operated at 320-dim, where it leads the sub-2B tier on ViDoRe V3: a 320-dim index is a fraction of the size a 2048-4096-dim retriever carries, for proportionally lower storage, RAM, and query-time compute at serving scale.

Intended use

Visual document retrieval / multimodal RAG over PDFs, scans, slides and reports, including pages with layout, tables, charts and figures.
Multilingual document collections (en, fr, de, es, it, pt).
Latency- and footprint-sensitive serving where a 2–4B retriever is too large: the 0.8B tier of the family.

Out of scope: text-only semantic search, where a single-vector dense embedder is cheaper; generative QA (this is a retriever; pair it with a reader/LLM).

Method

Per-token MaxSim scoring captures fine-grained matches against tables, figures, and layout that a single-vector embedder averages away.

Base: Qwen/Qwen3.5-0.8B (hybrid GatedDeltaNet + full-attention backbone).
Late-interaction retriever (ColQwen3_5): 320-dim multi-vector embeddings, MaxSim scoring, image + text inputs.
Size: 0.8B parameters, the generative head dropped for retrieval.
Matryoshka representation learning during training; the shipped checkpoint is the native 320-dim operating point.
Model merging: several independently-seeded checkpoints merged per-block into one full-weight checkpoint.
Trained at up to 1280 visual tokens, evaluated and deployed at 1792.

Training data

An enhanced, multilingual mixture of public and synthetic visual-document retrieval sources, spanning en, es, de, fr, it and pt, decontaminated against all three ViDoRe suites (V1/V2/V3): 0% measured overlap with the evaluation benchmarks. The training recipe and the assembled training dataset are not distributed in this repository.

Inputs and outputs

Input: document-page images (RGB) and/or text queries; pages encode at up to 1792 visual tokens.
Output: multi-vector embeddings, one 320-dim vector per token (not a single pooled vector).
Scoring: late-interaction MaxSim between query-token and page-token vectors, via score_multi_vector.

Requirements

The Qwen3.5 hybrid (GatedDeltaNet + full-attention) backbone has hard runtime kernel dependencies a vanilla ColQwen / PaliGemma card does not:

pip install "git+https://github.com/illuin-tech/colpali@2e0b927051af727238783af039dcc2c50a4d8c27"
pip install causal-conv1d flash-linear-attention

causal-conv1d + flash-linear-attention are required (the hybrid layers import them at runtime).
Attention must be SDPA. Retrieval runs bidirectional attention on the full-attention layers; flash_attention_2 silently ignores the 2-D mask and scores as if causal.

Usage

import torch
from PIL import Image
from colpali_engine.models import ColQwen3_5, ColQwen3_5Processor

model = ColQwen3_5.from_pretrained(
    "vultr/VultronRetrieverFlash-Qwen3.5-0.8B",
    torch_dtype=torch.bfloat16,
    attn_implementation="sdpa",   # required (see above)
    device_map="cuda:0",
).eval()
processor = ColQwen3_5Processor.from_pretrained(
    "vultr/VultronRetrieverFlash-Qwen3.5-0.8B",
    max_num_visual_tokens=1792,
)

# Document pages (rendered to images) and text queries
images = [Image.open("page_0.png"), Image.open("page_1.png")]
queries = ["What was Q3 revenue?", "Summarize the safety findings."]

with torch.no_grad():
    doc_emb = model(**processor.process_images(images).to(model.device))
    qry_emb = model(**processor.process_queries(queries).to(model.device))

# Late-interaction MaxSim scoring (feed fp32 to match the eval discipline)
scores = processor.score_multi_vector(qry_emb.float(), doc_emb.float())
# scores[i, j] = relevance of query i to page j
print(scores.shape)  # torch.Size([2, 2])

config.json carries dim=320, so custom_text_proj is sized correctly at load, with no manual config edits needed.

Serving with vLLM

vLLM serves this model natively through its pooling runner (the ColQwen3_5 architecture), returning the per-token multi-vectors for late-interaction scoring. It requires a vLLM build that includes the ColQwen3.5 retrieval-correctness fix (vllm-project/vllm#46108, merged 2026-06-22): build from main, or use a release tagged after that date. The fix runs the backbone bidirectionally and restores the projection bias, so vLLM reproduces the transformers reference within run-to-run noise. The server uses the stock chat/image processor, so the ColQwen3.5 prompt contract is applied client-side: wrap each page image in the instruction template, append the query-augmentation tokens to each query, and set the visual-token budget through mm-processor-kwargs. Prefix caching and chunked prefill must be off (bidirectional attention and the GatedDeltaNet hybrid both break the causal-prefix invariant).

import torch
from PIL import Image
from vllm import LLM
MODEL = "vultr/VultronRetrieverFlash-Qwen3.5-0.8B"
MAX_PIXELS = 1792 * 32 * 32   # max_num_visual_tokens * (patch_size 16 * merge_size 2)^2
llm = LLM(
    model=MODEL,
    runner="pooling",
    dtype="bfloat16",
    enable_prefix_caching=False,
    enable_chunked_prefill=False,
    mm_processor_kwargs={"min_pixels": 65536, "max_pixels": MAX_PIXELS},
)
# ColQwen3.5 processor contract, applied client-side:
IMAGE_PROMPT = ("<|im_start|>user\n<|vision_start|><|image_pad|><|vision_end|>"
                "Describe the image.<|im_end|><|endoftext|>")
def query_prompt(q): return q + "<|endoftext|>" * 10   # query augmentation
images  = [Image.open("page_0.png"), Image.open("page_1.png")]
queries = ["What was Q3 revenue?", "Summarize the safety findings."]
doc_out = llm.encode([{"prompt": IMAGE_PROMPT, "multi_modal_data": {"image": im}}
                      for im in images], pooling_task="token_embed")
qry_out = llm.encode([query_prompt(q) for q in queries], pooling_task="token_embed")
def mv(o):  # one [num_tokens, 320] multi-vector per item, L2-normalized per token
    t = torch.as_tensor(o.outputs.data, dtype=torch.float32)
    return torch.nn.functional.normalize(t, p=2, dim=-1)
docs, qrys = [mv(o) for o in doc_out], [mv(o) for o in qry_out]
# late-interaction MaxSim: per query token take the best doc token, then sum
scores = [[(q @ d.T).max(dim=-1).values.sum().item() for d in docs] for q in qrys]
print(scores)  # scores[i][j] = relevance of query i to page j

To serve over HTTP instead:

vllm serve vultr/VultronRetrieverFlash-Qwen3.5-0.8B \
  --runner pooling \
  --no-enable-prefix-caching --no-enable-chunked-prefill \
  --mm-processor-kwargs '{"min_pixels": 65536, "max_pixels": 1835008}'

Apply the same image template and query augmentation in your client requests. See the upstream example examples/pooling/score/colqwen3_5_rerank_online.py for the full online rerank flow.

Limitations

ViDoRe V3 figures cover 8 of the 10 V3 domains; Telecom and Nuclear have not been evaluated yet.
Tuned for six languages (en, fr, de, es, it, pt); other languages are out of distribution.
Late-interaction multi-vector indexes are larger than single-vector dense indexes: the trade for per-token layout/table/figure sensitivity (small for its class at 320-dim).
This is the small tier; for maximum ViDoRe V3 accuracy use the 8B flagship VultronRetrieverPrime-Qwen3.5-8B.

License

Apache 2.0, covering the contents of this repository: model weights, config, and evaluation results. Built on Qwen/Qwen3.5-0.8B (Apache 2.0); the upstream license and attribution are retained. The training recipe and the assembled training dataset are not distributed in this repository.

Citation

@misc{vultronretrieverflash2026,
  title  = {VultronRetrieverFlash-Qwen3.5-0.8B: Small-Tier Late-Interaction Visual Document Retrieval at 320 Dimensions},
  author = {Georgiou, Athos (athrael-soju)},
  year   = {2026},
  howpublished = {\url{https://huggingface.co/vultr/VultronRetrieverFlash-Qwen3.5-0.8B}}
}

Trained and evaluated on Vultr Cloud.