Btoks Qwen3VL 2B Instruct

Btoks is a multimodal embedding model fine-tuned from Qwen/Qwen3-VL-2B-Instruct with the Bottleneck Tokens method.

Model name on MMEB-V2 leaderboard: Btoks
Base model: Qwen/Qwen3-VL-2B-Instruct
Paper: https://arxiv.org/abs/2604.11095
Model repository: https://huggingface.co/siyrus/Btoks-Qwen3VL-2B-Instruct

MMEB-V2

Local validation with the public MMEB-V2 leaderboard scripts gives:

Overall	Image-Overall	Video-Overall	Visdoc-Overall
68.29	71.55	49.12	77.77

These numbers are not expected to match the paper tables one-to-one. After the paper experiments, we fixed a small number of data/evaluation bugs and trained with a larger data mixture and scale.

Training Summary

The model was trained with a compact BToks/SIEVE setup on top of Qwen3-VL:

BToks / SIEVE tokens: 4
LoRA rank / alpha / dropout: 16 / 32 / 0.05
DoRA: enabled
Generation-loss weight: 0.2
Training steps: 5000
Exported checkpoint: step 4500
Contrastive temperature: 0.02

The main training data sources include:

Image classification, VQA, retrieval, and grounding data, including ImageNet, N24News, VOC2007, SUN397, OK-VQA, A-OKVQA, DocVQA, ChartQA, Visual7W, GQA, TextVQA, VizWiz, VisDial, CIRR, VisualNews, MSCOCO, WebQA, FashionIQ, Wiki-SS-NQ, OVEN, EDIS, INFOSEEK, Fashion200K, and RefCOCO.
Visual document retrieval data from the ColPali / ViDoRe family and VisRAG in-domain training data.
Video classification, retrieval, QA, and moment-retrieval data, including Kinetics-700, Something-Something V2, HMDB51, UCF101, MSR-VTT, MSVD, DiDeMo, YouCook2, ActivityNet Captions, VATEX, QVHighlights, Charades-STA, NExTQA, and VideoChat2-IT.

Some source names overlap with MMEB-V2 benchmark dataset names because they come from the same public dataset families. The MMEB-V2 benchmark records were removed from the admitted training splits where overlap was possible, so the leaderboard evaluation does not use leaked benchmark samples for training.

Loading

This repository stores a merged inference checkpoint for the VLM2Emb/Btoks embedding wrapper. It is not a plain Qwen3-VL causal language model.

The public inference/evaluation code is being prepared and is expected to be released in about one week. Loading examples will be added once that code release is ready.

Downloads last month: 24

Safetensors

Model size

2B params

Tensor type

BF16

Inference Providers NEW

Image Feature Extraction

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for siyrus/Btoks-Qwen3VL-2B-Instruct

Base model

Qwen/Qwen3-VL-2B-Instruct

Finetuned

(222)

this model

Paper for siyrus/Btoks-Qwen3VL-2B-Instruct

Bottleneck Tokens for Unified Multimodal Retrieval

Paper • 2604.11095 • Published Apr 13