Instructions to use bealore/Qwen3-VL-Reranker-2B-GGUF with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- llama-cpp-python
How to use bealore/Qwen3-VL-Reranker-2B-GGUF with llama-cpp-python:
# !pip install llama-cpp-python from llama_cpp import Llama llm = Llama.from_pretrained( repo_id="bealore/Qwen3-VL-Reranker-2B-GGUF", filename="Qwen3-VL-Reranker-2B.f16.gguf", )
llm.create_chat_completion( messages = "No input example has been defined for this model task." )
- Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- llama.cpp
How to use bealore/Qwen3-VL-Reranker-2B-GGUF with llama.cpp:
Install (macOS, Linux)
curl -LsSf https://llama.app/install.sh | sh # Start a local OpenAI-compatible server with a web UI: llama serve -hf bealore/Qwen3-VL-Reranker-2B-GGUF:F16 # Run inference directly in the terminal: llama cli -hf bealore/Qwen3-VL-Reranker-2B-GGUF:F16
Install from WinGet (Windows)
winget install llama.cpp # Start a local OpenAI-compatible server with a web UI: llama serve -hf bealore/Qwen3-VL-Reranker-2B-GGUF:F16 # Run inference directly in the terminal: llama cli -hf bealore/Qwen3-VL-Reranker-2B-GGUF:F16
Use pre-built binary
# Download pre-built binary from: # https://github.com/ggerganov/llama.cpp/releases # Start a local OpenAI-compatible server with a web UI: ./llama-server -hf bealore/Qwen3-VL-Reranker-2B-GGUF:F16 # Run inference directly in the terminal: ./llama-cli -hf bealore/Qwen3-VL-Reranker-2B-GGUF:F16
Build from source code
git clone https://github.com/ggerganov/llama.cpp.git cd llama.cpp cmake -B build cmake --build build -j --target llama-server llama-cli # Start a local OpenAI-compatible server with a web UI: ./build/bin/llama-server -hf bealore/Qwen3-VL-Reranker-2B-GGUF:F16 # Run inference directly in the terminal: ./build/bin/llama-cli -hf bealore/Qwen3-VL-Reranker-2B-GGUF:F16
Use Docker
docker model run hf.co/bealore/Qwen3-VL-Reranker-2B-GGUF:F16
- LM Studio
- Jan
- Ollama
How to use bealore/Qwen3-VL-Reranker-2B-GGUF with Ollama:
ollama run hf.co/bealore/Qwen3-VL-Reranker-2B-GGUF:F16
- Unsloth Studio
How to use bealore/Qwen3-VL-Reranker-2B-GGUF with Unsloth Studio:
Install Unsloth Studio (macOS, Linux, WSL)
curl -fsSL https://unsloth.ai/install.sh | sh # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for bealore/Qwen3-VL-Reranker-2B-GGUF to start chatting
Install Unsloth Studio (Windows)
irm https://unsloth.ai/install.ps1 | iex # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for bealore/Qwen3-VL-Reranker-2B-GGUF to start chatting
Using HuggingFace Spaces for Unsloth
# No setup required # Open https://huggingface.co/spaces/unsloth/studio in your browser # Search for bealore/Qwen3-VL-Reranker-2B-GGUF to start chatting
- Pi
How to use bealore/Qwen3-VL-Reranker-2B-GGUF with Pi:
Start the llama.cpp server
# Install llama.cpp: brew install llama.cpp # Start a local OpenAI-compatible server: llama serve -hf bealore/Qwen3-VL-Reranker-2B-GGUF:F16
Configure the model in Pi
# Install Pi: npm install -g @mariozechner/pi-coding-agent # Add to ~/.pi/agent/models.json: { "providers": { "llama-cpp": { "baseUrl": "http://localhost:8080/v1", "api": "openai-completions", "apiKey": "none", "models": [ { "id": "bealore/Qwen3-VL-Reranker-2B-GGUF:F16" } ] } } }Run Pi
# Start Pi in your project directory: pi
- Hermes Agent new
How to use bealore/Qwen3-VL-Reranker-2B-GGUF with Hermes Agent:
Start the llama.cpp server
# Install llama.cpp: brew install llama.cpp # Start a local OpenAI-compatible server: llama serve -hf bealore/Qwen3-VL-Reranker-2B-GGUF:F16
Configure Hermes
# Install Hermes: curl -fsSL https://hermes-agent.nousresearch.com/install.sh | bash hermes setup # Point Hermes at the local server: hermes config set model.provider custom hermes config set model.base_url http://127.0.0.1:8080/v1 hermes config set model.default bealore/Qwen3-VL-Reranker-2B-GGUF:F16
Run Hermes
hermes
- Atomic Chat new
- Docker Model Runner
How to use bealore/Qwen3-VL-Reranker-2B-GGUF with Docker Model Runner:
docker model run hf.co/bealore/Qwen3-VL-Reranker-2B-GGUF:F16
- Lemonade
How to use bealore/Qwen3-VL-Reranker-2B-GGUF with Lemonade:
Pull the model
# Download Lemonade from https://lemonade-server.ai/ lemonade pull bealore/Qwen3-VL-Reranker-2B-GGUF:F16
Run and chat with the model
lemonade run user.Qwen3-VL-Reranker-2B-GGUF-F16
List all available models
lemonade list
Qwen3-VL-Reranker-2B โ GGUF (llama.cpp)
GGUF builds of Qwen/Qwen3-VL-Reranker-2B for reranking with llama.cpp.
This is a multimodal reranker: both the query and the candidate documents can
be text, images, or interleaved text+image (the included mmproj provides
the vision tower). The HF pipeline_tag is set to text-ranking only because
that is the closest value in HF's fixed pipeline vocabulary โ there is no
dedicated multimodal-ranking tag โ and it matches the upstream model's tag.
A reranker scores how relevant each candidate document is to a query. Qwen3 /
Qwen3-VL rerankers are generative cross-encoders that express relevance as the
probability of the token yes vs no at the final position. The
official convert_hf_to_gguf.py bakes that behaviour into a 2-class
rank-pooling head, so at inference llama.cpp softmaxes the [yes, no] logits
and returns relevance_score = P("yes") โ [0, 1].
โ ๏ธ Why this conversion is different
Several community GGUFs of this model were exported as plain generative
Qwen3-VL models โ they are missing the reranker head (cls.output.weight),
the pooling_type = RANK metadata, and the baked rerank chat template. Loaded
with --pooling rank they don't error; they silently emit meaningless,
near-constant scores.
These files are a correct rank-pooling conversion and were verified to contain:
| Marker | Value |
|---|---|
qwen3vl.pooling_type |
4 (RANK) |
qwen3vl.classifier.output_labels |
["yes", "no"] |
cls.output.weight tensor |
shape (2048, 2) |
tokenizer.chat_template.rerank |
baked {query}/{document} template |
Files
| File | Size | Purpose |
|---|---|---|
Qwen3-VL-Reranker-2B.f16.gguf |
3.4 GB | Language model + rank head (f16). Required. |
Qwen3-VL-Reranker-2B.mmproj-f16.gguf |
822 MB | Vision projector (f16). Optional โ only for image / multimodal document reranking. |
Text-only reranking needs just the LM file.
Usage (llama.cpp)
Start llama-server in reranking mode and call the /v1/rerank endpoint:
llama-server -m Qwen3-VL-Reranker-2B.f16.gguf \
--reranking --pooling rank -ngl 99 --port 8080
curl http://localhost:8080/v1/rerank -H 'Content-Type: application/json' -d '{
"query": "What is the capital of France?",
"documents": [
"Paris is the capital and most populous city of France.",
"Bananas are a yellow tropical fruit rich in potassium.",
"The Eiffel Tower is a famous landmark located in Paris, France."
]
}'
The response lists documents sorted by relevance_score (descending). All three
of --reranking, --pooling rank, and a GPU offload (-ngl) are recommended;
without --reranking the server replies "This server does not support
reranking".
Multimodal note: the
mmprojfile enables reranking documents that contain images, via mtmd-aware runtimes. llama.cpp's stock/v1/rerankendpoint is text-only today; image-document reranking requires an mtmd pipeline that feeds the projector and reads the rank score from the pooled output.
Prompt / instruction
The rerank template bakes a default instruction:
Given a web search query, retrieve relevant passages that answer the query
A task-specific instruction (passed as the system message in mtmd pipelines, or substituted into the template) typically improves accuracy by a few points.
Verification
Smoke-tested on an RTX 3090 via llama-server --reranking --pooling rank:
Query: "What is the capital of France?"
0.733 Paris is the capital ... of France (relevant)
0.514 The Eiffel Tower ... in Paris, France (related)
0.270 Bananas are a yellow ... fruit (irrelevant)
Query: "How do I reverse a string in Python?"
0.706 ... my_string[::-1] ... reverse order (relevant)
0.691 Use slicing with a step of -1 ... (relevant)
0.233 The mitochondria is the powerhouse ... (irrelevant)
Clean separation between relevant and irrelevant documents โ the expected behaviour of a correctly-converted rank head.
Conversion
Converted from the official safetensors with llama.cpp's convert_hf_to_gguf.py,
which auto-detects the Qwen3-VL reranker and extracts the yes/no rows of the
LM head into cls.output.weight:
python convert_hf_to_gguf.py Qwen3-VL-Reranker-2B --outtype f16 \
--outfile Qwen3-VL-Reranker-2B.f16.gguf
python convert_hf_to_gguf.py Qwen3-VL-Reranker-2B --mmproj \
--outfile Qwen3-VL-Reranker-2B.mmproj-f16.gguf
License & attribution
Apache-2.0, inherited from the base model Qwen/Qwen3-VL-Reranker-2B (built on Qwen3-VL-2B-Instruct). All credit for the model itself goes to the Qwen team; these are format conversions only.
- Downloads last month
- 325
16-bit