Instructions to use Pranshurs/groundcheck-modernbert with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use Pranshurs/groundcheck-modernbert with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-classification", model="Pranshurs/groundcheck-modernbert")# Load model directly from transformers import AutoTokenizer, AutoModelForSequenceClassification tokenizer = AutoTokenizer.from_pretrained("Pranshurs/groundcheck-modernbert") model = AutoModelForSequenceClassification.from_pretrained("Pranshurs/groundcheck-modernbert") - Notebooks
- Google Colab
- Kaggle
GroundCheck (ModernBERT-base)
A small encoder for hallucination / grounding detection in RAG. Given a source text and
an answer, it predicts whether the answer is supported by the source (grounded) or not
(hallucinated), with a confidence score. ~150M parameters, runs on CPU.
- Labels:
0 = grounded,1 = hallucinated P(grounded) = softmax(logits)[0]- Code, benchmark, and demo: https://github.com/Pranshurs/groundcheck
v2 (current). Retrained on a mixed corpus to fix v1's main weakness: v1 was trained on long RAG documents only and tended to call short or fact-edited inputs "grounded". v2 keeps the same RAGTruth score as v1 (0.682 vs 0.688, within the confidence interval) while generalizing far better outside that distribution — see the second table.
Results
RAGTruth test set (n=2,500), example-level detection. F1 is for the positive (hallucinated) class; interval is a 2,000-resample bootstrap.
| Model | Params | F1 | Accuracy |
|---|---|---|---|
| GPT-4-turbo, zero-shot judge¹ | API | 0.634 | — |
| GroundCheck v2 (ModernBERT-base) | 150M | 0.682 (95% CI 0.66–0.71) | 0.747 |
| GroundCheck v1, for reference | 150M | 0.688 (95% CI 0.66–0.71) | 0.774 |
Generalization beyond RAGTruth (v1 had no signal here):
| Suite | v2 |
|---|---|
| VitaminC test (short claim–evidence pairs, n=2,000), accuracy | 0.850 |
| Single-fact flips in grounded answers caught (n=500) | 80.4% |
| Same answers unflipped, correctly kept grounded (n=500) | 76.2% |
¹ Published prompt-judge baseline (LettuceDetect, 2025). GroundCheck outperforms prompt-based GPT-4 judging while running on CPU.
Usage
With transformers:
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch
name = "Pranshurs/groundcheck-modernbert"
tok = AutoTokenizer.from_pretrained(name)
model = AutoModelForSequenceClassification.from_pretrained(name).eval()
source = "France's capital and largest city is Paris."
answer = "Paris is the capital of France."
enc = tok(source, answer, truncation="only_first", max_length=512, return_tensors="pt")
with torch.no_grad():
probs = torch.softmax(model(**enc).logits, dim=-1)[0]
grounded = probs[0].item() # P(answer is supported)
print("grounded" if grounded >= 0.5 else "hallucinated", round(grounded, 3))
With the GroundCheck library:
from groundcheck import GroundCheck
gc = GroundCheck()
print(gc.check(source="...", answer="..."))
Training
Fine-tuned answerdotai/ModernBERT-base as a sequence-pair classifier (premise = optional
question + source, hypothesis = answer). v2 trains on 28,500 examples:
- 10k RAGTruth (long RAG documents; hallucinated if any span was annotated unsupported)
- 16k VitaminC (short claim–evidence pairs; SUPPORTS → grounded, REFUTES and NOT-ENOUGH-INFO → hallucinated)
- 2.5k programmatic hard negatives: grounded RAGTruth answers with exactly one fact flipped (a number, a date, a direction word such as increased→decreased, or a named entity)
The question is dropped on half the RAGTruth rows, so inference without a question matches training. 3 epochs, sequence length 512, batch 16, fp16 on a single P100.
Intended use & limitations
Designed for research and non-commercial use. It checks support against the provided source only — it is not a world-knowledge fact-checker. v2 trades a little precision for recall on long fact-dense documents: it flags more hallucinations than v1, at the cost of some extra false alarms on correct answers (RAGTruth accuracy 0.747 vs 0.774).
License & credits
MIT. Base model answerdotai/ModernBERT-base (Apache-2.0). Trained on
RAGTruth (MIT; built from MS MARCO, the Yelp
Open Dataset, and CNN/DailyMail) and VitaminC
(CC BY-SA 3.0) — see those sources for their terms before any commercial use.
- Downloads last month
- 18
Model tree for Pranshurs/groundcheck-modernbert
Base model
answerdotai/ModernBERT-base