Instructions to use jolleyboy/gte-reranker-modernbert-base-GGUF with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- llama-cpp-python
How to use jolleyboy/gte-reranker-modernbert-base-GGUF with llama-cpp-python:
# !pip install llama-cpp-python from llama_cpp import Llama llm = Llama.from_pretrained( repo_id="jolleyboy/gte-reranker-modernbert-base-GGUF", filename="gte-reranker-modernbert-base-Q4_K_M.gguf", )
output = llm( "Once upon a time,", max_tokens=512, echo=True ) print(output)
- Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- llama.cpp
How to use jolleyboy/gte-reranker-modernbert-base-GGUF with llama.cpp:
Install (macOS, Linux)
curl -LsSf https://llama.app/install.sh | sh # Start a local OpenAI-compatible server with a web UI: llama serve -hf jolleyboy/gte-reranker-modernbert-base-GGUF:Q4_K_M # Run inference directly in the terminal: llama cli -hf jolleyboy/gte-reranker-modernbert-base-GGUF:Q4_K_M
Install from WinGet (Windows)
winget install llama.cpp # Start a local OpenAI-compatible server with a web UI: llama serve -hf jolleyboy/gte-reranker-modernbert-base-GGUF:Q4_K_M # Run inference directly in the terminal: llama cli -hf jolleyboy/gte-reranker-modernbert-base-GGUF:Q4_K_M
Use pre-built binary
# Download pre-built binary from: # https://github.com/ggerganov/llama.cpp/releases # Start a local OpenAI-compatible server with a web UI: ./llama-server -hf jolleyboy/gte-reranker-modernbert-base-GGUF:Q4_K_M # Run inference directly in the terminal: ./llama-cli -hf jolleyboy/gte-reranker-modernbert-base-GGUF:Q4_K_M
Build from source code
git clone https://github.com/ggerganov/llama.cpp.git cd llama.cpp cmake -B build cmake --build build -j --target llama-server llama-cli # Start a local OpenAI-compatible server with a web UI: ./build/bin/llama-server -hf jolleyboy/gte-reranker-modernbert-base-GGUF:Q4_K_M # Run inference directly in the terminal: ./build/bin/llama-cli -hf jolleyboy/gte-reranker-modernbert-base-GGUF:Q4_K_M
Use Docker
docker model run hf.co/jolleyboy/gte-reranker-modernbert-base-GGUF:Q4_K_M
- LM Studio
- Jan
- Ollama
How to use jolleyboy/gte-reranker-modernbert-base-GGUF with Ollama:
ollama run hf.co/jolleyboy/gte-reranker-modernbert-base-GGUF:Q4_K_M
- Unsloth Studio
How to use jolleyboy/gte-reranker-modernbert-base-GGUF with Unsloth Studio:
Install Unsloth Studio (macOS, Linux, WSL)
curl -fsSL https://unsloth.ai/install.sh | sh # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for jolleyboy/gte-reranker-modernbert-base-GGUF to start chatting
Install Unsloth Studio (Windows)
irm https://unsloth.ai/install.ps1 | iex # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for jolleyboy/gte-reranker-modernbert-base-GGUF to start chatting
Using HuggingFace Spaces for Unsloth
# No setup required # Open https://huggingface.co/spaces/unsloth/studio in your browser # Search for jolleyboy/gte-reranker-modernbert-base-GGUF to start chatting
- Atomic Chat new
- Docker Model Runner
How to use jolleyboy/gte-reranker-modernbert-base-GGUF with Docker Model Runner:
docker model run hf.co/jolleyboy/gte-reranker-modernbert-base-GGUF:Q4_K_M
- Lemonade
How to use jolleyboy/gte-reranker-modernbert-base-GGUF with Lemonade:
Pull the model
# Download Lemonade from https://lemonade-server.ai/ lemonade pull jolleyboy/gte-reranker-modernbert-base-GGUF:Q4_K_M
Run and chat with the model
lemonade run user.gte-reranker-modernbert-base-GGUF-Q4_K_M
List all available models
lemonade list
gte-reranker-modernbert-base-GGUF
GGUF conversions of Alibaba-NLP/gte-reranker-modernbert-base, a ModernBERT-based cross-encoder reranker, for use with llama.cpp.
All files were converted with convert_hf_to_gguf.py from upstream llama.cpp
(which supports the ModernBertForSequenceClassification architecture and its
reranker classification head) and validated against the original model's
published reference scores โ see the table below.
Files
| File | Quant | Size | Notes |
|---|---|---|---|
gte-reranker-modernbert-base-f16.gguf |
F16 | 287 MB | Reference precision, numerically faithful to the original. |
gte-reranker-modernbert-base-Q8_0.gguf |
Q8_0 | 153 MB | Near-lossless. Recommended if you want a smaller file. |
gte-reranker-modernbert-base-Q6_K.gguf |
Q6_K | 123 MB | Small additional drift. |
gte-reranker-modernbert-base-Q4_K_M.gguf |
Q4_K_M | 101 MB | Smallest; measurable score drift but ranking preserved on test cases. |
At only ~150M parameters the absolute size savings from aggressive quantization are modest, and quantization erodes the fine-grained score discrimination that is the point of a reranker. F16 or Q8_0 is recommended unless you are tightly memory-constrained.
Usage
# Build llama.cpp for your backend (CUDA: -DGGML_CUDA=ON, Metal is on by default on macOS)
llama-server -m gte-reranker-modernbert-base-f16.gguf --reranking
# Then POST to /rerank (or /v1/rerank):
curl -X POST http://127.0.0.1:8080/rerank -H "Content-Type: application/json" -d '{
"query": "what is the capital of China?",
"documents": ["Beijing", "Shanghai"]
}'
Note on score scale: llama.cpp's /rerank returns the raw logit, not the
0โ1 score that sentence-transformers' CrossEncoder.predict() returns. To match
the original model's 0โ1 scores, apply a sigmoid: score = 1 / (1 + exp(-logit)).
Ranking order is identical either way (sigmoid is monotonic).
Validation
Each quant was checked against the original model card's two published reference
pairs (after applying sigmoid to the GGUF logits) and against a 4-document ranking
whose ground-truth order from transformers is [0, 2, 1, 3].
| Quant | pair 1 (capital of China / Beijing) | pair 2 (quick sort / Introduction) | 4-doc order | max diff vs published |
|---|---|---|---|---|
| published | 0.894566 | 0.921359 | [0,2,1,3] |
โ |
| F16 | 0.894557 | 0.921488 | [0,2,1,3] |
1.3e-04 |
| Q8_0 | 0.895888 | 0.921445 | [0,2,1,3] |
1.3e-03 |
| Q6_K | 0.891452 | 0.925220 | [0,2,1,3] |
3.9e-03 |
| Q4_K_M | 0.879339 | 0.920432 | [0,2,1,3] |
1.5e-02 |
All quants preserve the correct ranking on the test cases; score fidelity degrades with quantization as expected.
License & attribution
Apache-2.0, inherited from the base model Alibaba-NLP/gte-reranker-modernbert-base. All credit for the model itself goes to the GTE team at Alibaba-NLP (arXiv:2308.03281). These are format conversions only.
- Downloads last month
- 209
4-bit
6-bit
8-bit
16-bit
Model tree for jolleyboy/gte-reranker-modernbert-base-GGUF
Base model
answerdotai/ModernBERT-base