Instructions to use Metry63/attest-grounding-large with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use Metry63/attest-grounding-large with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-classification", model="Metry63/attest-grounding-large")# Load model directly from transformers import AutoTokenizer, AutoModelForSequenceClassification tokenizer = AutoTokenizer.from_pretrained("Metry63/attest-grounding-large") model = AutoModelForSequenceClassification.from_pretrained("Metry63/attest-grounding-large") - Notebooks
- Google Colab
- Kaggle
attest-grounding-large
A 0.4B NLI model fine-tuned to detect ungrounded claims in RAG answers — i.e. sentences an LLM stated that the retrieved sources don't actually support. On the RAGTruth benchmark it matches a Claude Opus LLM-as-judge on F1 (0.75 vs 0.76) and beats it on precision, at $0 vs ~$12.73 per 1,000 checks.
Grounding is framed as Natural Language Inference: a claim is supported if a source entails it. The model keeps the base 3-class NLI head (entailment / neutral / contradiction) — read the entailment probability as the grounding score.
Full project, benchmark harness, and methodology: https://github.com/Metry630/attest
Results — RAGTruth (500 held-out test examples, zero train/test source overlap)
| System | Size | Acc | Precision | Recall | F1 | Cost / 1k |
|---|---|---|---|---|---|---|
| base DeBERTa-MNLI | 0.18B | 0.60 | 0.48 | 0.89 | 0.63 | $0 |
| Vectara HHEM-2.1-open | 0.1B | 0.72 | 0.59 | 0.88 | 0.71 | $0 |
| off-the-shelf DeBERTa-large-MNLI | 0.4B | 0.60 | 0.49 | 0.92 | 0.64 | $0 |
| this model (fine-tuned) | 0.4B | 0.81 | 0.73 | 0.78 | 0.75 | $0 |
| Claude Opus 4.8 (LLM judge) | — | 0.78 | 0.64 | 0.92 | 0.76 | $12.73 |
The gain is from fine-tuning, not size: the same 0.4B architecture off-the-shelf scores 0.64 (identical to the 0.18B base). Consistent with published work (prompt-based GPT-4-turbo ≈ 0.63, LettuceDetect-large ≈ 0.79).
Usage
import torch
from transformers import AutoTokenizer, AutoModelForSequenceClassification
tok = AutoTokenizer.from_pretrained("Metry63/attest-grounding-large")
model = AutoModelForSequenceClassification.from_pretrained("Metry63/attest-grounding-large").eval()
ent_idx = next(i for i, l in model.config.id2label.items() if "entail" in l.lower())
source = "The Eiffel Tower was completed in 1889 and stands 330 metres tall in Paris."
claim = "The Eiffel Tower is the tallest building in the world."
with torch.inference_mode():
logits = model(**tok(source, claim, return_tensors="pt", truncation=True, max_length=512)).logits
supported = logits.softmax(-1)[0][ent_idx].item()
print(f"grounded (entailment) prob: {supported:.2f}") # ~0.0 here -> not supported
For the full response-level pipeline (sentence splitting, chunk retrieval, and
aggregation), use the attest library.
Training
- Base:
MoritzLaurer/DeBERTa-v3-large-mnli-fever-anli-ling-wanli(3-class NLI). - Data: ~112k sentence-level examples derived from RAGTruth's character-level
hallucination spans — a sentence overlapping an evident-conflict span is labeled
contradiction, a baseless-info spanneutral, otherwiseentailment. - Setup: class-weighted loss (grounded sentences dominate), early stopping.
- Evaluated on the RAGTruth
testsplit, which shares zero source passages withtrain.
Limitations
- The LLM judge has higher recall (0.92) — it catches more hallucinations, with more false positives. This model is the more precise detector, not the most sensitive one.
- Not SOTA — purpose-built LettuceDetect-large (0.79) is higher.
- English only; evaluated on RAGTruth (news summary, QA, data-to-text). Behavior on other domains is untested.
Credit
Builds on the NLI-as-factual-consistency line (TRUE, MiniCheck, AlignScore, LettuceDetect). Benchmark: RAGTruth.
- Downloads last month
- -