attest-grounding-large

A 0.4B NLI model fine-tuned to detect ungrounded claims in RAG answers — i.e. sentences an LLM stated that the retrieved sources don't actually support. On the RAGTruth benchmark it matches a Claude Opus LLM-as-judge on F1 (0.75 vs 0.76) and beats it on precision, at $0 vs ~$12.73 per 1,000 checks.

Grounding is framed as Natural Language Inference: a claim is supported if a source entails it. The model keeps the base 3-class NLI head (entailment / neutral / contradiction) — read the entailment probability as the grounding score.

Full project, benchmark harness, and methodology: https://github.com/Metry630/attest

Results — RAGTruth (500 held-out test examples, zero train/test source overlap)

System Size Acc Precision Recall F1 Cost / 1k
base DeBERTa-MNLI 0.18B 0.60 0.48 0.89 0.63 $0
Vectara HHEM-2.1-open 0.1B 0.72 0.59 0.88 0.71 $0
off-the-shelf DeBERTa-large-MNLI 0.4B 0.60 0.49 0.92 0.64 $0
this model (fine-tuned) 0.4B 0.81 0.73 0.78 0.75 $0
Claude Opus 4.8 (LLM judge) — 0.78 0.64 0.92 0.76 $12.73

The gain is from fine-tuning, not size: the same 0.4B architecture off-the-shelf scores 0.64 (identical to the 0.18B base). Consistent with published work (prompt-based GPT-4-turbo ≈ 0.63, LettuceDetect-large ≈ 0.79).

Usage

import torch
from transformers import AutoTokenizer, AutoModelForSequenceClassification

tok = AutoTokenizer.from_pretrained("Metry63/attest-grounding-large")
model = AutoModelForSequenceClassification.from_pretrained("Metry63/attest-grounding-large").eval()
ent_idx = next(i for i, l in model.config.id2label.items() if "entail" in l.lower())

source = "The Eiffel Tower was completed in 1889 and stands 330 metres tall in Paris."
claim  = "The Eiffel Tower is the tallest building in the world."

with torch.inference_mode():
    logits = model(**tok(source, claim, return_tensors="pt", truncation=True, max_length=512)).logits
supported = logits.softmax(-1)[0][ent_idx].item()
print(f"grounded (entailment) prob: {supported:.2f}")   # ~0.0 here -> not supported

For the full response-level pipeline (sentence splitting, chunk retrieval, and aggregation), use the attest library.

Training

  • Base: MoritzLaurer/DeBERTa-v3-large-mnli-fever-anli-ling-wanli (3-class NLI).
  • Data: ~112k sentence-level examples derived from RAGTruth's character-level hallucination spans — a sentence overlapping an evident-conflict span is labeled contradiction, a baseless-info span neutral, otherwise entailment.
  • Setup: class-weighted loss (grounded sentences dominate), early stopping.
  • Evaluated on the RAGTruth test split, which shares zero source passages with train.

Limitations

  • The LLM judge has higher recall (0.92) — it catches more hallucinations, with more false positives. This model is the more precise detector, not the most sensitive one.
  • Not SOTA — purpose-built LettuceDetect-large (0.79) is higher.
  • English only; evaluated on RAGTruth (news summary, QA, data-to-text). Behavior on other domains is untested.

Credit

Builds on the NLI-as-factual-consistency line (TRUE, MiniCheck, AlignScore, LettuceDetect). Benchmark: RAGTruth.

Downloads last month
-
Safetensors
Model size
0.4B params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for Metry63/attest-grounding-large

Dataset used to train Metry63/attest-grounding-large

Paper for Metry63/attest-grounding-large