Qwen3-VL-Reranker-2B-NVFP4

NVFP4 W4A16 quantized version of Qwen3-VL-Reranker-2B using NVIDIA ModelOpt.

Quantization Details

Item Value
Base model Qwen/Qwen3-VL-Reranker-2B
Quantization tool NVIDIA ModelOpt v0.44.0
Quantization format W4A16 NVFP4 — weights in FP4, activations in BF16
Model size 4.0 GB (bf16) → 2.1 GB
Weight block size 16
Skipped layers lm_head, model.visual* (vision encoder)

Hardware Requirements

Supports NVIDIA Ampere and later GPUs via the Marlin FP4 kernel. Blackwell GPUs provide additional performance benefits.

Usage (vLLM)

Start Reranker Server

vllm serve jeffpeng3/Qwen3-VL-Reranker-2B-NVFP4 \
  --runner pooling \
  --hf_overrides '{"architectures": ["Qwen3VLForSequenceClassification"],"classifier_from_token": ["no", "yes"],"is_original_qwen3_reranker": true}' \
  --quantization modelopt \
  --dtype bfloat16 \
  --max-model-len 1024

Score API Example

from vllm import LLM

llm = LLM(
    model="jeffpeng3/Qwen3-VL-Reranker-2B-NVFP4",
    runner="pooling",
    dtype="bfloat16",
    max_model_len=1024,
    hf_overrides={
        "architectures": ["Qwen3VLForSequenceClassification"],
        "classifier_from_token": ["no", "yes"],
        "is_original_qwen3_reranker": True,
    },
)

query = "A woman playing with her dog on a beach at sunset."
documents = [
    {"text": "A woman shares a joyful moment with her golden retriever on a sun-drenched beach."},
    {"text": "Mars is known as the Red Planet."},
]

for doc in documents:
    outputs = llm.score(
        query,
        {"content": [{"type": "text", "text": doc["text"]}]},
        chat_template="additional_chat_templates/reranker.jinja",
    )
    print(f"Score: {outputs[0].outputs.score}")

See the base model card for detailed usage and benchmarks.

Citation

@article{qwen3vlembedding,
  title={Qwen3-VL-Embedding and Qwen3-VL-Reranker: A Unified Framework for State-of-the-Art Multimodal Retrieval and Ranking},
  author={Li, Mingxin and Zhang, Yanzhao and Long, Dingkun and Chen Keqin and Song, Sibo and Bai, Shuai and Yang, Zhibo and Xie, Pengjun and Yang, An and Liu, Dayiheng and Zhou, Jingren and Lin, Junyang},
  journal={arXiv preprint arXiv:2601.04720},
  year={2026}
}
Downloads last month
378
Safetensors
Model size
2B params
Tensor type
BF16
·
F8_E4M3
·
U8
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for jeffpeng3/Qwen3-VL-Reranker-2B-NVFP4

Quantized
(13)
this model

Paper for jeffpeng3/Qwen3-VL-Reranker-2B-NVFP4