Instructions to use bealore/Qwen3-VL-Reranker-2B-GGUF with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use bealore/Qwen3-VL-Reranker-2B-GGUF with llama-cpp-python:

# !pip install llama-cpp-python

from llama_cpp import Llama

llm = Llama.from_pretrained(
	repo_id="bealore/Qwen3-VL-Reranker-2B-GGUF",
	filename="Qwen3-VL-Reranker-2B.f16.gguf",
)

llm.create_chat_completion(
	messages = "No input example has been defined for this model task."
)

Notebooks
Google Colab
Kaggle
Local Apps Settings

llama.cpp

How to use bealore/Qwen3-VL-Reranker-2B-GGUF with llama.cpp:

Install (macOS, Linux)

curl -LsSf https://llama.app/install.sh | sh
# Start a local OpenAI-compatible server with a web UI:
llama serve -hf bealore/Qwen3-VL-Reranker-2B-GGUF:F16
# Run inference directly in the terminal:
llama cli -hf bealore/Qwen3-VL-Reranker-2B-GGUF:F16

Install from WinGet (Windows)

winget install llama.cpp
# Start a local OpenAI-compatible server with a web UI:
llama serve -hf bealore/Qwen3-VL-Reranker-2B-GGUF:F16
# Run inference directly in the terminal:
llama cli -hf bealore/Qwen3-VL-Reranker-2B-GGUF:F16

Use pre-built binary

# Download pre-built binary from:
# https://github.com/ggerganov/llama.cpp/releases
# Start a local OpenAI-compatible server with a web UI:
./llama-server -hf bealore/Qwen3-VL-Reranker-2B-GGUF:F16
# Run inference directly in the terminal:
./llama-cli -hf bealore/Qwen3-VL-Reranker-2B-GGUF:F16

Build from source code

git clone https://github.com/ggerganov/llama.cpp.git
cd llama.cpp
cmake -B build
cmake --build build -j --target llama-server llama-cli
# Start a local OpenAI-compatible server with a web UI:
./build/bin/llama-server -hf bealore/Qwen3-VL-Reranker-2B-GGUF:F16
# Run inference directly in the terminal:
./build/bin/llama-cli -hf bealore/Qwen3-VL-Reranker-2B-GGUF:F16

Use Docker

docker model run hf.co/bealore/Qwen3-VL-Reranker-2B-GGUF:F16

LM Studio
Jan
Ollama
How to use bealore/Qwen3-VL-Reranker-2B-GGUF with Ollama:
```
ollama run hf.co/bealore/Qwen3-VL-Reranker-2B-GGUF:F16
```

Unsloth Studio

How to use bealore/Qwen3-VL-Reranker-2B-GGUF with Unsloth Studio:

Install Unsloth Studio (macOS, Linux, WSL)

curl -fsSL https://unsloth.ai/install.sh | sh
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for bealore/Qwen3-VL-Reranker-2B-GGUF to start chatting

Install Unsloth Studio (Windows)

irm https://unsloth.ai/install.ps1 | iex
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for bealore/Qwen3-VL-Reranker-2B-GGUF to start chatting

Using HuggingFace Spaces for Unsloth

# No setup required
# Open https://huggingface.co/spaces/unsloth/studio in your browser
# Search for bealore/Qwen3-VL-Reranker-2B-GGUF to start chatting

How to use bealore/Qwen3-VL-Reranker-2B-GGUF with Pi:

Start the llama.cpp server

# Install llama.cpp:
brew install llama.cpp
# Start a local OpenAI-compatible server:
llama serve -hf bealore/Qwen3-VL-Reranker-2B-GGUF:F16

Configure the model in Pi

# Install Pi:
npm install -g @mariozechner/pi-coding-agent
# Add to ~/.pi/agent/models.json:
{
  "providers": {
    "llama-cpp": {
      "baseUrl": "http://localhost:8080/v1",
      "api": "openai-completions",
      "apiKey": "none",
      "models": [
        {
          "id": "bealore/Qwen3-VL-Reranker-2B-GGUF:F16"
        }
      ]
    }
  }
}

Run Pi

# Start Pi in your project directory:
pi

Hermes Agent new

How to use bealore/Qwen3-VL-Reranker-2B-GGUF with Hermes Agent:

Start the llama.cpp server

# Install llama.cpp:
brew install llama.cpp
# Start a local OpenAI-compatible server:
llama serve -hf bealore/Qwen3-VL-Reranker-2B-GGUF:F16

Configure Hermes

# Install Hermes:
curl -fsSL https://hermes-agent.nousresearch.com/install.sh | bash
hermes setup
# Point Hermes at the local server:
hermes config set model.provider custom
hermes config set model.base_url http://127.0.0.1:8080/v1
hermes config set model.default bealore/Qwen3-VL-Reranker-2B-GGUF:F16

Run Hermes

hermes

Atomic Chat new
Docker Model Runner
How to use bealore/Qwen3-VL-Reranker-2B-GGUF with Docker Model Runner:
```
docker model run hf.co/bealore/Qwen3-VL-Reranker-2B-GGUF:F16
```

Lemonade

How to use bealore/Qwen3-VL-Reranker-2B-GGUF with Lemonade:

Pull the model

# Download Lemonade from https://lemonade-server.ai/
lemonade pull bealore/Qwen3-VL-Reranker-2B-GGUF:F16

Run and chat with the model

lemonade run user.Qwen3-VL-Reranker-2B-GGUF-F16

List all available models

lemonade list

Qwen3-VL-Reranker-2B — GGUF (llama.cpp)

GGUF builds of Qwen/Qwen3-VL-Reranker-2B for reranking with llama.cpp.

This is a multimodal reranker: both the query and the candidate documents can be text, images, or interleaved text+image (the included mmproj provides the vision tower). The HF pipeline_tag is set to text-ranking only because that is the closest value in HF's fixed pipeline vocabulary — there is no dedicated multimodal-ranking tag — and it matches the upstream model's tag.

A reranker scores how relevant each candidate document is to a query. Qwen3 / Qwen3-VL rerankers are generative cross-encoders that express relevance as the probability of the token yes vs no at the final position. The official convert_hf_to_gguf.py bakes that behaviour into a 2-class rank-pooling head, so at inference llama.cpp softmaxes the [yes, no] logits and returns relevance_score = P("yes") ∈ [0, 1].

⚠️ Why this conversion is different

Several community GGUFs of this model were exported as plain generative Qwen3-VL models — they are missing the reranker head (cls.output.weight), the pooling_type = RANK metadata, and the baked rerank chat template. Loaded with --pooling rank they don't error; they silently emit meaningless, near-constant scores.

These files are a correct rank-pooling conversion and were verified to contain:

Marker	Value
`qwen3vl.pooling_type`	`4` (RANK)
`qwen3vl.classifier.output_labels`	`["yes", "no"]`
`cls.output.weight` tensor	shape `(2048, 2)`
`tokenizer.chat_template.rerank`	baked `{query}`/`{document}` template

Files

File	Size	Purpose
`Qwen3-VL-Reranker-2B.f16.gguf`	3.4 GB	Language model + rank head (f16). Required.
`Qwen3-VL-Reranker-2B.mmproj-f16.gguf`	822 MB	Vision projector (f16). Optional — only for image / multimodal document reranking.

Text-only reranking needs just the LM file.

Usage (llama.cpp)

Start llama-server in reranking mode and call the /v1/rerank endpoint:

llama-server -m Qwen3-VL-Reranker-2B.f16.gguf \
  --reranking --pooling rank -ngl 99 --port 8080

curl http://localhost:8080/v1/rerank -H 'Content-Type: application/json' -d '{
  "query": "What is the capital of France?",
  "documents": [
    "Paris is the capital and most populous city of France.",
    "Bananas are a yellow tropical fruit rich in potassium.",
    "The Eiffel Tower is a famous landmark located in Paris, France."
  ]
}'

The response lists documents sorted by relevance_score (descending). All three of --reranking, --pooling rank, and a GPU offload (-ngl) are recommended; without --reranking the server replies "This server does not support reranking".

Multimodal note: the mmproj file enables reranking documents that contain images, via mtmd-aware runtimes. llama.cpp's stock /v1/rerank endpoint is text-only today; image-document reranking requires an mtmd pipeline that feeds the projector and reads the rank score from the pooled output.

Prompt / instruction

The rerank template bakes a default instruction:

Given a web search query, retrieve relevant passages that answer the query

A task-specific instruction (passed as the system message in mtmd pipelines, or substituted into the template) typically improves accuracy by a few points.

Verification

Smoke-tested on an RTX 3090 via llama-server --reranking --pooling rank:

Query: "What is the capital of France?"
  0.733  Paris is the capital ... of France      (relevant)
  0.514  The Eiffel Tower ... in Paris, France    (related)
  0.270  Bananas are a yellow ... fruit           (irrelevant)

Query: "How do I reverse a string in Python?"
  0.706  ... my_string[::-1] ... reverse order    (relevant)
  0.691  Use slicing with a step of -1 ...         (relevant)
  0.233  The mitochondria is the powerhouse ...    (irrelevant)

Clean separation between relevant and irrelevant documents — the expected behaviour of a correctly-converted rank head.

Conversion

Converted from the official safetensors with llama.cpp's convert_hf_to_gguf.py, which auto-detects the Qwen3-VL reranker and extracts the yes/no rows of the LM head into cls.output.weight:

python convert_hf_to_gguf.py Qwen3-VL-Reranker-2B --outtype f16 \
  --outfile Qwen3-VL-Reranker-2B.f16.gguf
python convert_hf_to_gguf.py Qwen3-VL-Reranker-2B --mmproj \
  --outfile Qwen3-VL-Reranker-2B.mmproj-f16.gguf

License & attribution

Apache-2.0, inherited from the base model Qwen/Qwen3-VL-Reranker-2B (built on Qwen3-VL-2B-Instruct). All credit for the model itself goes to the Qwen team; these are format conversions only.

Downloads last month: 325

GGUF

Model size

2B params

Architecture

qwen3vl

Hardware compatibility

16-bit

Inference Providers NEW

Text Ranking

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for bealore/Qwen3-VL-Reranker-2B-GGUF

Base model

Qwen/Qwen3-VL-2B-Instruct

Finetuned

Qwen/Qwen3-VL-Reranker-2B

Quantized

(13)

this model