Instructions to use vultr/VultronRetrieverFlash-Qwen3.5-0.8B with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- ColPali
How to use vultr/VultronRetrieverFlash-Qwen3.5-0.8B with ColPali:
# No code snippets available yet for this library. # To use this model, check the repository files and the library's documentation. # Want to help? PRs adding snippets are welcome at: # https://github.com/huggingface/huggingface.js
- Notebooks
- Google Colab
- Kaggle
VultronRetrieverFlash-Qwen3.5-0.8B
Sub-2B #1 on ViDoRe V3, at 0.8B and 320 dimensions.
VultronRetrieverFlash is the small tier of the VultronRetriever family, a late-interaction (ColBERT-style) retriever that scores document pages directly from their rendered image: layout, tables, charts and text, across six languages. At 0.8B it leads the sub-2B field on ViDoRe V3 by ~7 points and outscores retrievers several times its size.
The family has three tiers on one 320-dim recipe: the 8B Prime for maximum accuracy, the 4.5B Core for the accuracy/footprint mid-point, and the 0.8B Flash (this model) for latency- and footprint-sensitive serving. Trained and evaluated on Vultr Cloud.
Highlights
- Sub-2B #1 on ViDoRe V3: 56.49 nDCG@10 at 0.8B, ~7 points clear of the next sub-2B model.
- Punches above its size: outscores retrievers 3โ5ร its parameter count on V2 and V3.
- Official MTEB: V1 88.15, V2 60.36, V3 56.49.
- 0.8B parameters, 320-dim, โ1.6 GB bf16, runs on a single modest GPU.
- Six languages (en, fr, de, es, it, pt).
ViDoRe leaderboard (ranked by V3)
Ranked by ViDoRe V3 (mean nDCG@10), the headline benchmark; V1 and V2 are shown alongside. Our three tiers are in bold; models that have not reported V3 are listed last.
| Model | Params | Dim | V1 | V2 | V3 |
|---|---|---|---|---|---|
| VultronRetrieverPrime-Qwen3.5-8B (ours) | 8.4B | 320 | 92.08 | 68.18 | 64.72 |
| webAI-Official/webAI-ColVec1-9b | 9.4B | 2560 | 91.30 | 65.82 | 64.45 |
| VultronRetrieverCore-Qwen3.5-4.5B (ours) | 4.5B | 320 | 92.21 | 66.12 | 63.72 |
| nvidia/nemotron-colembed-vl-8b-v2 | 8.7B | 4096 | 92.65 | 65.16 | 63.54 |
| webAI-Official/webAI-ColVec1-4b | 4.5B | 640 | 90.49 | 63.60 | 63.39 |
| TomoroAI/tomoro-colqwen3-embed-8b | 8.0B | 320 | 90.76 | 65.40 | 61.60 |
| athrael-soju/colqwen3.5-4.5B-v3 | 4.6B | 128 | 91.54 | 64.25 | 61.56 |
| nvidia/nemotron-colembed-vl-4b-v2 | 4.8B | 2560 | 91.62 | 64.49 | 61.42 |
| OpenSearch-AI/Ops-Colqwen3-4B | 4.8B | 2560 | 91.36 | 68.66 | 61.27 |
| TomoroAI/tomoro-colqwen3-embed-4b | 4.0B | 320 | 90.57 | 64.69 | 60.16 |
| nvidia/llama-nemotron-colembed-vl-3b-v2 | 4.4B | 3072 | 91.74 | 63.38 | 59.70 |
| VAGOsolutions/SauerkrautLM-ColQwen3-8b-v0.1 | 8.1B | 128 | 91.08 | 62.47 | 58.55 |
| nomic-ai/colnomic-embed-multimodal-7b | 7.0B | 128 | 89.72 | 60.25 | 57.64 |
| jinaai/jina-embeddings-v4 | 3.9B | 2048 | 90.35 | 58.23 | 57.54 |
| nvidia/llama-nemoretriever-colembed-3b-v1 | 4.4B | 3072 | 91.00 | 63.32 | 57.07 |
| VultronRetrieverFlash-Qwen3.5-0.8B (this model) | 0.85B | 320 | 88.15 | 60.36 | 56.49 |
| nomic-ai/colnomic-embed-multimodal-3b | 3.0B | 128 | 89.86 | 55.68 | 56.40 |
| VAGOsolutions/SauerkrautLM-ColQwen3-4b-v0.1 | 4.4B | 128 | 90.80 | 59.89 | 56.03 |
| Verm1ion/ColTurk-VDR-Qwen3VL-4B-v1.0 | 4.5B | 128 | โ | โ | 55.82 |
| nvidia/llama-nemoretriever-colembed-1b-v1 | 2.4B | 2048 | 90.50 | 62.96 | 55.48 |
| vidore/colqwen2.5-v0.2 | 3.0B | 128 | 89.54 | 60.06 | 52.44 |
| DataScience-UIBK/Argus-Colqwen3.5-9b-v0 | 8.8B | 1024 | 92.67 | 69.27 | โ |
| DataScience-UIBK/Argus-Colqwen3.5-4b-v0 | 4.7B | 1024 | 92.30 | 64.18 | โ |
V1/V2: full mean nDCG@5. V3: mean nDCG@10 over the 8 public ViDoRe V3 tasks (the 2 private tasks, Nuclear and Telecom, are pending for all our tiers). Competitor figures: public MTEB ViDoRe leaderboard, 2026-06-21 snapshot; "โ" = not reported on that benchmark.
At 0.8B, Flash leads the sub-2B field by ~7 points and outscores several retrievers three to five times its size on both V2 and V3, sitting among models that run 3.9โ9B.
Official MTEB results (dim 320 / visual tokens 1792)
| Benchmark | Metric | Tasks | Score |
|---|---|---|---|
| ViDoRe V1 | ndcg@5 | 10 | 0.8815 |
| ViDoRe V2 | ndcg@5 | 4 | 0.6036 |
| ViDoRe V3 | ndcg@10 | 8 | 0.5649 |
Per-task JSONs are in eval_results/. Measured with the official MTEB late-interaction evaluator.
Why 320 dimensions
Late-interaction index size, memory footprint, and MaxSim scoring cost all scale with the embedding dimension. Flash is trained with Matryoshka representation learning and operated at 320-dim, where it leads the sub-2B tier on ViDoRe V3: a 320-dim index is a fraction of the size a 2048-4096-dim retriever carries, for proportionally lower storage, RAM, and query-time compute at serving scale.
Intended use
- Visual document retrieval / multimodal RAG over PDFs, scans, slides and reports, including pages with layout, tables, charts and figures.
- Multilingual document collections (en, fr, de, es, it, pt).
- Latency- and footprint-sensitive serving where a 2โ4B retriever is too large: the 0.8B tier of the family.
Out of scope: text-only semantic search, where a single-vector dense embedder is cheaper; generative QA (this is a retriever; pair it with a reader/LLM).
Method
Per-token MaxSim scoring captures fine-grained matches against tables, figures, and layout that a single-vector embedder averages away.
- Base:
Qwen/Qwen3.5-0.8B(hybrid GatedDeltaNet + full-attention backbone). - Late-interaction retriever (
ColQwen3_5): 320-dim multi-vector embeddings, MaxSim scoring, image + text inputs. - Size: 0.8B parameters, the generative head dropped for retrieval.
- Matryoshka representation learning during training; the shipped checkpoint is the native 320-dim operating point.
- Model merging: several independently-seeded checkpoints merged per-block into one full-weight checkpoint.
- Trained at up to 1280 visual tokens, evaluated and deployed at 1792.
Training data
An enhanced, multilingual mixture of public and synthetic visual-document retrieval sources, spanning en, es, de, fr, it and pt, decontaminated against all three ViDoRe suites (V1/V2/V3): 0% measured overlap with the evaluation benchmarks. The training recipe and the assembled training dataset are not distributed in this repository.
Inputs and outputs
- Input: document-page images (RGB) and/or text queries; pages encode at up to 1792 visual tokens.
- Output: multi-vector embeddings, one 320-dim vector per token (not a single pooled vector).
- Scoring: late-interaction MaxSim between query-token and page-token vectors, via
score_multi_vector.
Requirements
The Qwen3.5 hybrid (GatedDeltaNet + full-attention) backbone has hard runtime kernel dependencies a vanilla ColQwen / PaliGemma card does not:
pip install "git+https://github.com/illuin-tech/colpali@2e0b927051af727238783af039dcc2c50a4d8c27"
pip install causal-conv1d flash-linear-attention
causal-conv1d+flash-linear-attentionare required (the hybrid layers import them at runtime).- Attention must be SDPA. Retrieval runs bidirectional attention on the full-attention layers;
flash_attention_2silently ignores the 2-D mask and scores as if causal.
Usage
import torch
from PIL import Image
from colpali_engine.models import ColQwen3_5, ColQwen3_5Processor
model = ColQwen3_5.from_pretrained(
"vultr/VultronRetrieverFlash-Qwen3.5-0.8B",
torch_dtype=torch.bfloat16,
attn_implementation="sdpa", # required (see above)
device_map="cuda:0",
).eval()
processor = ColQwen3_5Processor.from_pretrained(
"vultr/VultronRetrieverFlash-Qwen3.5-0.8B",
max_num_visual_tokens=1792,
)
# Document pages (rendered to images) and text queries
images = [Image.open("page_0.png"), Image.open("page_1.png")]
queries = ["What was Q3 revenue?", "Summarize the safety findings."]
with torch.no_grad():
doc_emb = model(**processor.process_images(images).to(model.device))
qry_emb = model(**processor.process_queries(queries).to(model.device))
# Late-interaction MaxSim scoring (feed fp32 to match the eval discipline)
scores = processor.score_multi_vector(qry_emb.float(), doc_emb.float())
# scores[i, j] = relevance of query i to page j
print(scores.shape) # torch.Size([2, 2])
config.json carries dim=320, so custom_text_proj is sized correctly at load, with no manual
config edits needed.
Serving with vLLM
vLLM serves this model natively through its pooling runner (the ColQwen3_5 architecture), returning
the per-token multi-vectors for late-interaction scoring. It requires a vLLM build that includes the
ColQwen3.5 retrieval-correctness fix (vllm-project/vllm#46108,
merged 2026-06-22): build from main, or use a release tagged after that date. The fix runs the
backbone bidirectionally and restores the projection bias, so vLLM reproduces the transformers
reference within run-to-run noise. The server uses the stock chat/image processor, so the ColQwen3.5
prompt contract is applied client-side: wrap each page image in the instruction template, append
the query-augmentation tokens to each query, and set the visual-token budget through
mm-processor-kwargs. Prefix caching and chunked prefill must be off (bidirectional attention and
the GatedDeltaNet hybrid both break the causal-prefix invariant).
import torch
from PIL import Image
from vllm import LLM
MODEL = "vultr/VultronRetrieverFlash-Qwen3.5-0.8B"
MAX_PIXELS = 1792 * 32 * 32 # max_num_visual_tokens * (patch_size 16 * merge_size 2)^2
llm = LLM(
model=MODEL,
runner="pooling",
dtype="bfloat16",
enable_prefix_caching=False,
enable_chunked_prefill=False,
mm_processor_kwargs={"min_pixels": 65536, "max_pixels": MAX_PIXELS},
)
# ColQwen3.5 processor contract, applied client-side:
IMAGE_PROMPT = ("<|im_start|>user\n<|vision_start|><|image_pad|><|vision_end|>"
"Describe the image.<|im_end|><|endoftext|>")
def query_prompt(q): return q + "<|endoftext|>" * 10 # query augmentation
images = [Image.open("page_0.png"), Image.open("page_1.png")]
queries = ["What was Q3 revenue?", "Summarize the safety findings."]
doc_out = llm.encode([{"prompt": IMAGE_PROMPT, "multi_modal_data": {"image": im}}
for im in images], pooling_task="token_embed")
qry_out = llm.encode([query_prompt(q) for q in queries], pooling_task="token_embed")
def mv(o): # one [num_tokens, 320] multi-vector per item, L2-normalized per token
t = torch.as_tensor(o.outputs.data, dtype=torch.float32)
return torch.nn.functional.normalize(t, p=2, dim=-1)
docs, qrys = [mv(o) for o in doc_out], [mv(o) for o in qry_out]
# late-interaction MaxSim: per query token take the best doc token, then sum
scores = [[(q @ d.T).max(dim=-1).values.sum().item() for d in docs] for q in qrys]
print(scores) # scores[i][j] = relevance of query i to page j
To serve over HTTP instead:
vllm serve vultr/VultronRetrieverFlash-Qwen3.5-0.8B \
--runner pooling \
--no-enable-prefix-caching --no-enable-chunked-prefill \
--mm-processor-kwargs '{"min_pixels": 65536, "max_pixels": 1835008}'
Apply the same image template and query augmentation in your client requests. See the upstream example
examples/pooling/score/colqwen3_5_rerank_online.py for the full online rerank flow.
Limitations
- ViDoRe V3 figures cover 8 of the 10 V3 domains; Telecom and Nuclear have not been evaluated yet.
- Tuned for six languages (en, fr, de, es, it, pt); other languages are out of distribution.
- Late-interaction multi-vector indexes are larger than single-vector dense indexes: the trade for per-token layout/table/figure sensitivity (small for its class at 320-dim).
- This is the small tier; for maximum ViDoRe V3 accuracy use the 8B flagship VultronRetrieverPrime-Qwen3.5-8B.
License
Apache 2.0, covering the contents of this repository: model weights, config, and evaluation results.
Built on Qwen/Qwen3.5-0.8B (Apache 2.0); the upstream license and attribution are retained. The
training recipe and the assembled training dataset are not distributed in this repository.
Citation
@misc{vultronretrieverflash2026,
title = {VultronRetrieverFlash-Qwen3.5-0.8B: Small-Tier Late-Interaction Visual Document Retrieval at 320 Dimensions},
author = {Georgiou, Athos (athrael-soju)},
year = {2026},
howpublished = {\url{https://huggingface.co/vultr/VultronRetrieverFlash-Qwen3.5-0.8B}}
}
Trained and evaluated on Vultr Cloud.
- Downloads last month
- 71