GroundCheck (ModernBERT-base)

A small encoder for hallucination / grounding detection in RAG. Given a source text and an answer, it predicts whether the answer is supported by the source (grounded) or not (hallucinated), with a confidence score. ~150M parameters, runs on CPU.

Labels: 0 = grounded, 1 = hallucinated
P(grounded) = softmax(logits)[0]
Code, benchmark, and demo: https://github.com/Pranshurs/groundcheck

v2 (current). Retrained on a mixed corpus to fix v1's main weakness: v1 was trained on long RAG documents only and tended to call short or fact-edited inputs "grounded". v2 keeps the same RAGTruth score as v1 (0.682 vs 0.688, within the confidence interval) while generalizing far better outside that distribution — see the second table.

Results

RAGTruth test set (n=2,500), example-level detection. F1 is for the positive (hallucinated) class; interval is a 2,000-resample bootstrap.

Model	Params	F1	Accuracy
GPT-4-turbo, zero-shot judge¹	API	0.634	—
GroundCheck v2 (ModernBERT-base)	150M	0.682 (95% CI 0.66–0.71)	0.747
GroundCheck v1, for reference	150M	0.688 (95% CI 0.66–0.71)	0.774

Generalization beyond RAGTruth (v1 had no signal here):

Suite	v2
VitaminC test (short claim–evidence pairs, n=2,000), accuracy	0.850
Single-fact flips in grounded answers caught (n=500)	80.4%
Same answers unflipped, correctly kept grounded (n=500)	76.2%

¹ Published prompt-judge baseline (LettuceDetect, 2025). GroundCheck outperforms prompt-based GPT-4 judging while running on CPU.

Usage

With transformers:

from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch

name = "Pranshurs/groundcheck-modernbert"
tok = AutoTokenizer.from_pretrained(name)
model = AutoModelForSequenceClassification.from_pretrained(name).eval()

source = "France's capital and largest city is Paris."
answer = "Paris is the capital of France."
enc = tok(source, answer, truncation="only_first", max_length=512, return_tensors="pt")
with torch.no_grad():
    probs = torch.softmax(model(**enc).logits, dim=-1)[0]
grounded = probs[0].item()                      # P(answer is supported)
print("grounded" if grounded >= 0.5 else "hallucinated", round(grounded, 3))

With the GroundCheck library:

from groundcheck import GroundCheck
gc = GroundCheck()
print(gc.check(source="...", answer="..."))

Training

Fine-tuned answerdotai/ModernBERT-base as a sequence-pair classifier (premise = optional question + source, hypothesis = answer). v2 trains on 28,500 examples:

10k RAGTruth (long RAG documents; hallucinated if any span was annotated unsupported)
16k VitaminC (short claim–evidence pairs; SUPPORTS → grounded, REFUTES and NOT-ENOUGH-INFO → hallucinated)
2.5k programmatic hard negatives: grounded RAGTruth answers with exactly one fact flipped (a number, a date, a direction word such as increased→decreased, or a named entity)

The question is dropped on half the RAGTruth rows, so inference without a question matches training. 3 epochs, sequence length 512, batch 16, fp16 on a single P100.

Intended use & limitations

Designed for research and non-commercial use. It checks support against the provided source only — it is not a world-knowledge fact-checker. v2 trades a little precision for recall on long fact-dense documents: it flags more hallucinations than v1, at the cost of some extra false alarms on correct answers (RAGTruth accuracy 0.747 vs 0.774).

License & credits

MIT. Base model answerdotai/ModernBERT-base (Apache-2.0). Trained on RAGTruth (MIT; built from MS MARCO, the Yelp Open Dataset, and CNN/DailyMail) and VitaminC (CC BY-SA 3.0) — see those sources for their terms before any commercial use.

Downloads last month: 18

Safetensors

Model size

0.1B params

Tensor type

F32

Model tree for Pranshurs/groundcheck-modernbert

Base model

answerdotai/ModernBERT-base

Finetuned

(1324)

this model

Pranshurs
/

groundcheck-modernbert