attest-grounding-large

A 0.4B NLI model fine-tuned to detect ungrounded claims in RAG answers — i.e. sentences an LLM stated that the retrieved sources don't actually support. On the RAGTruth benchmark it matches a Claude Opus LLM-as-judge on F1 (0.75 vs 0.76) and beats it on precision, at $0 vs ~$12.73 per 1,000 checks.

Grounding is framed as Natural Language Inference: a claim is supported if a source entails it. The model keeps the base 3-class NLI head (entailment / neutral / contradiction) — read the entailment probability as the grounding score.

Full project, benchmark harness, and methodology: https://github.com/Metry630/attest

Results — RAGTruth (500 held-out test examples, zero train/test source overlap)

System	Size	Acc	Precision	Recall	F1	Cost / 1k
base DeBERTa-MNLI	0.18B	0.60	0.48	0.89	0.63	$0
Vectara HHEM-2.1-open	0.1B	0.72	0.59	0.88	0.71	$0
off-the-shelf DeBERTa-large-MNLI	0.4B	0.60	0.49	0.92	0.64	$0
this model (fine-tuned)	0.4B	0.81	0.73	0.78	0.75	$0
Claude Opus 4.8 (LLM judge)	—	0.78	0.64	0.92	0.76	$12.73

The gain is from fine-tuning, not size: the same 0.4B architecture off-the-shelf scores 0.64 (identical to the 0.18B base). Consistent with published work (prompt-based GPT-4-turbo ≈ 0.63, LettuceDetect-large ≈ 0.79).

Usage

import torch
from transformers import AutoTokenizer, AutoModelForSequenceClassification

tok = AutoTokenizer.from_pretrained("Metry63/attest-grounding-large")
model = AutoModelForSequenceClassification.from_pretrained("Metry63/attest-grounding-large").eval()
ent_idx = next(i for i, l in model.config.id2label.items() if "entail" in l.lower())

source = "The Eiffel Tower was completed in 1889 and stands 330 metres tall in Paris."
claim  = "The Eiffel Tower is the tallest building in the world."

with torch.inference_mode():
    logits = model(**tok(source, claim, return_tensors="pt", truncation=True, max_length=512)).logits
supported = logits.softmax(-1)[0][ent_idx].item()
print(f"grounded (entailment) prob: {supported:.2f}")   # ~0.0 here -> not supported

For the full response-level pipeline (sentence splitting, chunk retrieval, and aggregation), use the attest library.

Training

Base: MoritzLaurer/DeBERTa-v3-large-mnli-fever-anli-ling-wanli (3-class NLI).
Data: ~112k sentence-level examples derived from RAGTruth's character-level hallucination spans — a sentence overlapping an evident-conflict span is labeled contradiction, a baseless-info span neutral, otherwise entailment.
Setup: class-weighted loss (grounded sentences dominate), early stopping.
Evaluated on the RAGTruth test split, which shares zero source passages with train.

Limitations

The LLM judge has higher recall (0.92) — it catches more hallucinations, with more false positives. This model is the more precise detector, not the most sensitive one.
Not SOTA — purpose-built LettuceDetect-large (0.79) is higher.
English only; evaluated on RAGTruth (news summary, QA, data-to-text). Behavior on other domains is untested.

Credit

Builds on the NLI-as-factual-consistency line (TRUE, MiniCheck, AlignScore, LettuceDetect). Benchmark: RAGTruth.

Downloads last month: -

Safetensors

Model size

0.4B params

Tensor type

F32

Model tree for Metry63/attest-grounding-large

Base model

MoritzLaurer/DeBERTa-v3-large-mnli-fever-anli-ling-wanli

Finetuned

(9)

this model

Dataset used to train Metry63/attest-grounding-large

Paper for Metry63/attest-grounding-large

RAGTruth: A Hallucination Corpus for Developing Trustworthy Retrieval-Augmented Language Models

Paper • 2401.00396 • Published Dec 31, 2023 • 6