Qwen3-VL-Reranker-2B โ€” GGUF (llama.cpp)

GGUF builds of Qwen/Qwen3-VL-Reranker-2B for reranking with llama.cpp.

This is a multimodal reranker: both the query and the candidate documents can be text, images, or interleaved text+image (the included mmproj provides the vision tower). The HF pipeline_tag is set to text-ranking only because that is the closest value in HF's fixed pipeline vocabulary โ€” there is no dedicated multimodal-ranking tag โ€” and it matches the upstream model's tag.

A reranker scores how relevant each candidate document is to a query. Qwen3 / Qwen3-VL rerankers are generative cross-encoders that express relevance as the probability of the token yes vs no at the final position. The official convert_hf_to_gguf.py bakes that behaviour into a 2-class rank-pooling head, so at inference llama.cpp softmaxes the [yes, no] logits and returns relevance_score = P("yes") โˆˆ [0, 1].

โš ๏ธ Why this conversion is different

Several community GGUFs of this model were exported as plain generative Qwen3-VL models โ€” they are missing the reranker head (cls.output.weight), the pooling_type = RANK metadata, and the baked rerank chat template. Loaded with --pooling rank they don't error; they silently emit meaningless, near-constant scores.

These files are a correct rank-pooling conversion and were verified to contain:

Marker Value
qwen3vl.pooling_type 4 (RANK)
qwen3vl.classifier.output_labels ["yes", "no"]
cls.output.weight tensor shape (2048, 2)
tokenizer.chat_template.rerank baked {query}/{document} template

Files

File Size Purpose
Qwen3-VL-Reranker-2B.f16.gguf 3.4 GB Language model + rank head (f16). Required.
Qwen3-VL-Reranker-2B.mmproj-f16.gguf 822 MB Vision projector (f16). Optional โ€” only for image / multimodal document reranking.

Text-only reranking needs just the LM file.

Usage (llama.cpp)

Start llama-server in reranking mode and call the /v1/rerank endpoint:

llama-server -m Qwen3-VL-Reranker-2B.f16.gguf \
  --reranking --pooling rank -ngl 99 --port 8080
curl http://localhost:8080/v1/rerank -H 'Content-Type: application/json' -d '{
  "query": "What is the capital of France?",
  "documents": [
    "Paris is the capital and most populous city of France.",
    "Bananas are a yellow tropical fruit rich in potassium.",
    "The Eiffel Tower is a famous landmark located in Paris, France."
  ]
}'

The response lists documents sorted by relevance_score (descending). All three of --reranking, --pooling rank, and a GPU offload (-ngl) are recommended; without --reranking the server replies "This server does not support reranking".

Multimodal note: the mmproj file enables reranking documents that contain images, via mtmd-aware runtimes. llama.cpp's stock /v1/rerank endpoint is text-only today; image-document reranking requires an mtmd pipeline that feeds the projector and reads the rank score from the pooled output.

Prompt / instruction

The rerank template bakes a default instruction:

Given a web search query, retrieve relevant passages that answer the query

A task-specific instruction (passed as the system message in mtmd pipelines, or substituted into the template) typically improves accuracy by a few points.

Verification

Smoke-tested on an RTX 3090 via llama-server --reranking --pooling rank:

Query: "What is the capital of France?"
  0.733  Paris is the capital ... of France      (relevant)
  0.514  The Eiffel Tower ... in Paris, France    (related)
  0.270  Bananas are a yellow ... fruit           (irrelevant)

Query: "How do I reverse a string in Python?"
  0.706  ... my_string[::-1] ... reverse order    (relevant)
  0.691  Use slicing with a step of -1 ...         (relevant)
  0.233  The mitochondria is the powerhouse ...    (irrelevant)

Clean separation between relevant and irrelevant documents โ€” the expected behaviour of a correctly-converted rank head.

Conversion

Converted from the official safetensors with llama.cpp's convert_hf_to_gguf.py, which auto-detects the Qwen3-VL reranker and extracts the yes/no rows of the LM head into cls.output.weight:

python convert_hf_to_gguf.py Qwen3-VL-Reranker-2B --outtype f16 \
  --outfile Qwen3-VL-Reranker-2B.f16.gguf
python convert_hf_to_gguf.py Qwen3-VL-Reranker-2B --mmproj \
  --outfile Qwen3-VL-Reranker-2B.mmproj-f16.gguf

License & attribution

Apache-2.0, inherited from the base model Qwen/Qwen3-VL-Reranker-2B (built on Qwen3-VL-2B-Instruct). All credit for the model itself goes to the Qwen team; these are format conversions only.

Downloads last month
325
GGUF
Model size
2B params
Architecture
qwen3vl
Hardware compatibility
Log In to add your hardware

16-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for bealore/Qwen3-VL-Reranker-2B-GGUF

Quantized
(13)
this model