GroundCheck (ModernBERT-base)

A small encoder for hallucination / grounding detection in RAG. Given a source text and an answer, it predicts whether the answer is supported by the source (grounded) or not (hallucinated), with a confidence score. ~150M parameters, runs on CPU.

v2 (current). Retrained on a mixed corpus to fix v1's main weakness: v1 was trained on long RAG documents only and tended to call short or fact-edited inputs "grounded". v2 keeps the same RAGTruth score as v1 (0.682 vs 0.688, within the confidence interval) while generalizing far better outside that distribution — see the second table.

Results

RAGTruth test set (n=2,500), example-level detection. F1 is for the positive (hallucinated) class; interval is a 2,000-resample bootstrap.

Model Params F1 Accuracy
GPT-4-turbo, zero-shot judge¹ API 0.634
GroundCheck v2 (ModernBERT-base) 150M 0.682 (95% CI 0.66–0.71) 0.747
GroundCheck v1, for reference 150M 0.688 (95% CI 0.66–0.71) 0.774

Generalization beyond RAGTruth (v1 had no signal here):

Suite v2
VitaminC test (short claim–evidence pairs, n=2,000), accuracy 0.850
Single-fact flips in grounded answers caught (n=500) 80.4%
Same answers unflipped, correctly kept grounded (n=500) 76.2%

¹ Published prompt-judge baseline (LettuceDetect, 2025). GroundCheck outperforms prompt-based GPT-4 judging while running on CPU.

Usage

With transformers:

from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch

name = "Pranshurs/groundcheck-modernbert"
tok = AutoTokenizer.from_pretrained(name)
model = AutoModelForSequenceClassification.from_pretrained(name).eval()

source = "France's capital and largest city is Paris."
answer = "Paris is the capital of France."
enc = tok(source, answer, truncation="only_first", max_length=512, return_tensors="pt")
with torch.no_grad():
    probs = torch.softmax(model(**enc).logits, dim=-1)[0]
grounded = probs[0].item()                      # P(answer is supported)
print("grounded" if grounded >= 0.5 else "hallucinated", round(grounded, 3))

With the GroundCheck library:

from groundcheck import GroundCheck
gc = GroundCheck()
print(gc.check(source="...", answer="..."))

Training

Fine-tuned answerdotai/ModernBERT-base as a sequence-pair classifier (premise = optional question + source, hypothesis = answer). v2 trains on 28,500 examples:

  • 10k RAGTruth (long RAG documents; hallucinated if any span was annotated unsupported)
  • 16k VitaminC (short claim–evidence pairs; SUPPORTS → grounded, REFUTES and NOT-ENOUGH-INFO → hallucinated)
  • 2.5k programmatic hard negatives: grounded RAGTruth answers with exactly one fact flipped (a number, a date, a direction word such as increased→decreased, or a named entity)

The question is dropped on half the RAGTruth rows, so inference without a question matches training. 3 epochs, sequence length 512, batch 16, fp16 on a single P100.

Intended use & limitations

Designed for research and non-commercial use. It checks support against the provided source only — it is not a world-knowledge fact-checker. v2 trades a little precision for recall on long fact-dense documents: it flags more hallucinations than v1, at the cost of some extra false alarms on correct answers (RAGTruth accuracy 0.747 vs 0.774).

License & credits

MIT. Base model answerdotai/ModernBERT-base (Apache-2.0). Trained on RAGTruth (MIT; built from MS MARCO, the Yelp Open Dataset, and CNN/DailyMail) and VitaminC (CC BY-SA 3.0) — see those sources for their terms before any commercial use.

Downloads last month
18
Safetensors
Model size
0.1B params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for Pranshurs/groundcheck-modernbert

Finetuned
(1324)
this model

Datasets used to train Pranshurs/groundcheck-modernbert