Instructions to use jolleyboy/gte-reranker-modernbert-base-GGUF with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use jolleyboy/gte-reranker-modernbert-base-GGUF with llama-cpp-python:

# !pip install llama-cpp-python

from llama_cpp import Llama

llm = Llama.from_pretrained(
	repo_id="jolleyboy/gte-reranker-modernbert-base-GGUF",
	filename="gte-reranker-modernbert-base-Q4_K_M.gguf",
)

output = llm(
	"Once upon a time,",
	max_tokens=512,
	echo=True
)
print(output)

Notebooks
Google Colab
Kaggle
Local Apps Settings

llama.cpp

How to use jolleyboy/gte-reranker-modernbert-base-GGUF with llama.cpp:

Install (macOS, Linux)

curl -LsSf https://llama.app/install.sh | sh
# Start a local OpenAI-compatible server with a web UI:
llama serve -hf jolleyboy/gte-reranker-modernbert-base-GGUF:Q4_K_M
# Run inference directly in the terminal:
llama cli -hf jolleyboy/gte-reranker-modernbert-base-GGUF:Q4_K_M

Install from WinGet (Windows)

winget install llama.cpp
# Start a local OpenAI-compatible server with a web UI:
llama serve -hf jolleyboy/gte-reranker-modernbert-base-GGUF:Q4_K_M
# Run inference directly in the terminal:
llama cli -hf jolleyboy/gte-reranker-modernbert-base-GGUF:Q4_K_M

Use pre-built binary

# Download pre-built binary from:
# https://github.com/ggerganov/llama.cpp/releases
# Start a local OpenAI-compatible server with a web UI:
./llama-server -hf jolleyboy/gte-reranker-modernbert-base-GGUF:Q4_K_M
# Run inference directly in the terminal:
./llama-cli -hf jolleyboy/gte-reranker-modernbert-base-GGUF:Q4_K_M

Build from source code

git clone https://github.com/ggerganov/llama.cpp.git
cd llama.cpp
cmake -B build
cmake --build build -j --target llama-server llama-cli
# Start a local OpenAI-compatible server with a web UI:
./build/bin/llama-server -hf jolleyboy/gte-reranker-modernbert-base-GGUF:Q4_K_M
# Run inference directly in the terminal:
./build/bin/llama-cli -hf jolleyboy/gte-reranker-modernbert-base-GGUF:Q4_K_M

Use Docker

docker model run hf.co/jolleyboy/gte-reranker-modernbert-base-GGUF:Q4_K_M

LM Studio
Jan
Ollama
How to use jolleyboy/gte-reranker-modernbert-base-GGUF with Ollama:
```
ollama run hf.co/jolleyboy/gte-reranker-modernbert-base-GGUF:Q4_K_M
```

Unsloth Studio

How to use jolleyboy/gte-reranker-modernbert-base-GGUF with Unsloth Studio:

Install Unsloth Studio (macOS, Linux, WSL)

curl -fsSL https://unsloth.ai/install.sh | sh
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for jolleyboy/gte-reranker-modernbert-base-GGUF to start chatting

Install Unsloth Studio (Windows)

irm https://unsloth.ai/install.ps1 | iex
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for jolleyboy/gte-reranker-modernbert-base-GGUF to start chatting

Using HuggingFace Spaces for Unsloth

# No setup required
# Open https://huggingface.co/spaces/unsloth/studio in your browser
# Search for jolleyboy/gte-reranker-modernbert-base-GGUF to start chatting

Atomic Chat new
Docker Model Runner
How to use jolleyboy/gte-reranker-modernbert-base-GGUF with Docker Model Runner:
```
docker model run hf.co/jolleyboy/gte-reranker-modernbert-base-GGUF:Q4_K_M
```

Lemonade

How to use jolleyboy/gte-reranker-modernbert-base-GGUF with Lemonade:

Pull the model

# Download Lemonade from https://lemonade-server.ai/
lemonade pull jolleyboy/gte-reranker-modernbert-base-GGUF:Q4_K_M

Run and chat with the model

lemonade run user.gte-reranker-modernbert-base-GGUF-Q4_K_M

List all available models

lemonade list

gte-reranker-modernbert-base-GGUF

GGUF conversions of Alibaba-NLP/gte-reranker-modernbert-base, a ModernBERT-based cross-encoder reranker, for use with llama.cpp.

All files were converted with convert_hf_to_gguf.py from upstream llama.cpp (which supports the ModernBertForSequenceClassification architecture and its reranker classification head) and validated against the original model's published reference scores — see the table below.

Files

File	Quant	Size	Notes
`gte-reranker-modernbert-base-f16.gguf`	F16	287 MB	Reference precision, numerically faithful to the original.
`gte-reranker-modernbert-base-Q8_0.gguf`	Q8_0	153 MB	Near-lossless. Recommended if you want a smaller file.
`gte-reranker-modernbert-base-Q6_K.gguf`	Q6_K	123 MB	Small additional drift.
`gte-reranker-modernbert-base-Q4_K_M.gguf`	Q4_K_M	101 MB	Smallest; measurable score drift but ranking preserved on test cases.

At only ~150M parameters the absolute size savings from aggressive quantization are modest, and quantization erodes the fine-grained score discrimination that is the point of a reranker. F16 or Q8_0 is recommended unless you are tightly memory-constrained.

Usage

# Build llama.cpp for your backend (CUDA: -DGGML_CUDA=ON, Metal is on by default on macOS)
llama-server -m gte-reranker-modernbert-base-f16.gguf --reranking

# Then POST to /rerank (or /v1/rerank):
curl -X POST http://127.0.0.1:8080/rerank -H "Content-Type: application/json" -d '{
  "query": "what is the capital of China?",
  "documents": ["Beijing", "Shanghai"]
}'

Note on score scale: llama.cpp's /rerank returns the raw logit, not the 0–1 score that sentence-transformers' CrossEncoder.predict() returns. To match the original model's 0–1 scores, apply a sigmoid: score = 1 / (1 + exp(-logit)). Ranking order is identical either way (sigmoid is monotonic).

Validation

Each quant was checked against the original model card's two published reference pairs (after applying sigmoid to the GGUF logits) and against a 4-document ranking whose ground-truth order from transformers is [0, 2, 1, 3].

Quant	pair 1 (capital of China / Beijing)	pair 2 (quick sort / Introduction)	4-doc order	max diff vs published
published	0.894566	0.921359	`[0,2,1,3]`	—
F16	0.894557	0.921488	`[0,2,1,3]`	1.3e-04
Q8_0	0.895888	0.921445	`[0,2,1,3]`	1.3e-03
Q6_K	0.891452	0.925220	`[0,2,1,3]`	3.9e-03
Q4_K_M	0.879339	0.920432	`[0,2,1,3]`	1.5e-02

All quants preserve the correct ranking on the test cases; score fidelity degrades with quantization as expected.

License & attribution

Apache-2.0, inherited from the base model Alibaba-NLP/gte-reranker-modernbert-base. All credit for the model itself goes to the GTE team at Alibaba-NLP (arXiv:2308.03281). These are format conversions only.

Downloads last month: 209

GGUF

Model size

0.1B params

Architecture

modern-bert

Hardware compatibility

4-bit

6-bit

8-bit

16-bit

Inference Providers NEW

Text Ranking

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for jolleyboy/gte-reranker-modernbert-base-GGUF

Base model

answerdotai/ModernBERT-base

Finetuned

Alibaba-NLP/gte-reranker-modernbert-base

Quantized

(6)

this model

Paper for jolleyboy/gte-reranker-modernbert-base-GGUF

Towards General Text Embeddings with Multi-stage Contrastive Learning

Paper • 2308.03281 • Published Aug 7, 2023 • 3