Btoks Qwen3VL 2B Instruct

Btoks is a multimodal embedding model fine-tuned from Qwen/Qwen3-VL-2B-Instruct with the Bottleneck Tokens method.

  • Model name on MMEB-V2 leaderboard: Btoks
  • Base model: Qwen/Qwen3-VL-2B-Instruct
  • Paper: https://arxiv.org/abs/2604.11095
  • Model repository: https://huggingface.co/siyrus/Btoks-Qwen3VL-2B-Instruct

MMEB-V2

Local validation with the public MMEB-V2 leaderboard scripts gives:

Overall Image-Overall Video-Overall Visdoc-Overall
68.29 71.55 49.12 77.77

These numbers are not expected to match the paper tables one-to-one. After the paper experiments, we fixed a small number of data/evaluation bugs and trained with a larger data mixture and scale.

Training Summary

The model was trained with a compact BToks/SIEVE setup on top of Qwen3-VL:

  • BToks / SIEVE tokens: 4
  • LoRA rank / alpha / dropout: 16 / 32 / 0.05
  • DoRA: enabled
  • Generation-loss weight: 0.2
  • Training steps: 5000
  • Exported checkpoint: step 4500
  • Contrastive temperature: 0.02

The main training data sources include:

  • Image classification, VQA, retrieval, and grounding data, including ImageNet, N24News, VOC2007, SUN397, OK-VQA, A-OKVQA, DocVQA, ChartQA, Visual7W, GQA, TextVQA, VizWiz, VisDial, CIRR, VisualNews, MSCOCO, WebQA, FashionIQ, Wiki-SS-NQ, OVEN, EDIS, INFOSEEK, Fashion200K, and RefCOCO.
  • Visual document retrieval data from the ColPali / ViDoRe family and VisRAG in-domain training data.
  • Video classification, retrieval, QA, and moment-retrieval data, including Kinetics-700, Something-Something V2, HMDB51, UCF101, MSR-VTT, MSVD, DiDeMo, YouCook2, ActivityNet Captions, VATEX, QVHighlights, Charades-STA, NExTQA, and VideoChat2-IT.

Some source names overlap with MMEB-V2 benchmark dataset names because they come from the same public dataset families. The MMEB-V2 benchmark records were removed from the admitted training splits where overlap was possible, so the leaderboard evaluation does not use leaked benchmark samples for training.

Loading

This repository stores a merged inference checkpoint for the VLM2Emb/Btoks embedding wrapper. It is not a plain Qwen3-VL causal language model.

The public inference/evaluation code is being prepared and is expected to be released in about one week. Loading examples will be added once that code release is ready.

Downloads last month
24
Safetensors
Model size
2B params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for siyrus/Btoks-Qwen3VL-2B-Instruct

Finetuned
(222)
this model

Paper for siyrus/Btoks-Qwen3VL-2B-Instruct