Qwen3-VL-Embedding and Qwen3-VL-Reranker: A Unified Framework for State-of-the-Art Multimodal Retrieval and Ranking
Paper • 2601.04720 • Published • 59
NVFP4 W4A16 quantized version of Qwen3-VL-Reranker-2B using NVIDIA ModelOpt.
| Item | Value |
|---|---|
| Base model | Qwen/Qwen3-VL-Reranker-2B |
| Quantization tool | NVIDIA ModelOpt v0.44.0 |
| Quantization format | W4A16 NVFP4 — weights in FP4, activations in BF16 |
| Model size | 4.0 GB (bf16) → 2.1 GB |
| Weight block size | 16 |
| Skipped layers | lm_head, model.visual* (vision encoder) |
Supports NVIDIA Ampere and later GPUs via the Marlin FP4 kernel. Blackwell GPUs provide additional performance benefits.
vllm serve jeffpeng3/Qwen3-VL-Reranker-2B-NVFP4 \
--runner pooling \
--hf_overrides '{"architectures": ["Qwen3VLForSequenceClassification"],"classifier_from_token": ["no", "yes"],"is_original_qwen3_reranker": true}' \
--quantization modelopt \
--dtype bfloat16 \
--max-model-len 1024
from vllm import LLM
llm = LLM(
model="jeffpeng3/Qwen3-VL-Reranker-2B-NVFP4",
runner="pooling",
dtype="bfloat16",
max_model_len=1024,
hf_overrides={
"architectures": ["Qwen3VLForSequenceClassification"],
"classifier_from_token": ["no", "yes"],
"is_original_qwen3_reranker": True,
},
)
query = "A woman playing with her dog on a beach at sunset."
documents = [
{"text": "A woman shares a joyful moment with her golden retriever on a sun-drenched beach."},
{"text": "Mars is known as the Red Planet."},
]
for doc in documents:
outputs = llm.score(
query,
{"content": [{"type": "text", "text": doc["text"]}]},
chat_template="additional_chat_templates/reranker.jinja",
)
print(f"Score: {outputs[0].outputs.score}")
See the base model card for detailed usage and benchmarks.
@article{qwen3vlembedding,
title={Qwen3-VL-Embedding and Qwen3-VL-Reranker: A Unified Framework for State-of-the-Art Multimodal Retrieval and Ranking},
author={Li, Mingxin and Zhang, Yanzhao and Long, Dingkun and Chen Keqin and Song, Sibo and Bai, Shuai and Yang, Zhibo and Xie, Pengjun and Yang, An and Liu, Dayiheng and Zhou, Jingren and Lin, Junyang},
journal={arXiv preprint arXiv:2601.04720},
year={2026}
}