gte-reranker-modernbert-base-GGUF

GGUF conversions of Alibaba-NLP/gte-reranker-modernbert-base, a ModernBERT-based cross-encoder reranker, for use with llama.cpp.

All files were converted with convert_hf_to_gguf.py from upstream llama.cpp (which supports the ModernBertForSequenceClassification architecture and its reranker classification head) and validated against the original model's published reference scores โ€” see the table below.

Files

File Quant Size Notes
gte-reranker-modernbert-base-f16.gguf F16 287 MB Reference precision, numerically faithful to the original.
gte-reranker-modernbert-base-Q8_0.gguf Q8_0 153 MB Near-lossless. Recommended if you want a smaller file.
gte-reranker-modernbert-base-Q6_K.gguf Q6_K 123 MB Small additional drift.
gte-reranker-modernbert-base-Q4_K_M.gguf Q4_K_M 101 MB Smallest; measurable score drift but ranking preserved on test cases.

At only ~150M parameters the absolute size savings from aggressive quantization are modest, and quantization erodes the fine-grained score discrimination that is the point of a reranker. F16 or Q8_0 is recommended unless you are tightly memory-constrained.

Usage

# Build llama.cpp for your backend (CUDA: -DGGML_CUDA=ON, Metal is on by default on macOS)
llama-server -m gte-reranker-modernbert-base-f16.gguf --reranking

# Then POST to /rerank (or /v1/rerank):
curl -X POST http://127.0.0.1:8080/rerank -H "Content-Type: application/json" -d '{
  "query": "what is the capital of China?",
  "documents": ["Beijing", "Shanghai"]
}'

Note on score scale: llama.cpp's /rerank returns the raw logit, not the 0โ€“1 score that sentence-transformers' CrossEncoder.predict() returns. To match the original model's 0โ€“1 scores, apply a sigmoid: score = 1 / (1 + exp(-logit)). Ranking order is identical either way (sigmoid is monotonic).

Validation

Each quant was checked against the original model card's two published reference pairs (after applying sigmoid to the GGUF logits) and against a 4-document ranking whose ground-truth order from transformers is [0, 2, 1, 3].

Quant pair 1 (capital of China / Beijing) pair 2 (quick sort / Introduction) 4-doc order max diff vs published
published 0.894566 0.921359 [0,2,1,3] โ€”
F16 0.894557 0.921488 [0,2,1,3] 1.3e-04
Q8_0 0.895888 0.921445 [0,2,1,3] 1.3e-03
Q6_K 0.891452 0.925220 [0,2,1,3] 3.9e-03
Q4_K_M 0.879339 0.920432 [0,2,1,3] 1.5e-02

All quants preserve the correct ranking on the test cases; score fidelity degrades with quantization as expected.

License & attribution

Apache-2.0, inherited from the base model Alibaba-NLP/gte-reranker-modernbert-base. All credit for the model itself goes to the GTE team at Alibaba-NLP (arXiv:2308.03281). These are format conversions only.

Downloads last month
209
GGUF
Model size
0.1B params
Architecture
modern-bert
Hardware compatibility
Log In to add your hardware

4-bit

6-bit

8-bit

16-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for jolleyboy/gte-reranker-modernbert-base-GGUF

Quantized
(6)
this model

Paper for jolleyboy/gte-reranker-modernbert-base-GGUF