LettuceDetect goes agentic — span-level hallucination detection across RAG, code, and tool output.

lettucedect-v2-qwen-2b: Generative Hallucination Span Detection

TL;DR — results

A 2B model that localizes and types hallucinated spans across code, tool output, and prose in one pass:

Unified test set (10,698): span-F1 0.689, example-F1 0.921, IoU 0.758.
Code-agent answers: span-F1 0.602 / example-F1 0.835 — where it beats every alternative we tested: our own 8B sibling (LFM-8B, 0.507), a self-hosted 120B judge (0.21) and a 550B judge (0.22), and off-the-shelf detectors (HHEM / Lynx-8B / Granite-Guardian / MiniCheck, all ≈ chance). Large general judges over-flag generated code; this model doesn't.
Established prose benchmarks: RAGTruth example-F1 0.818 (> LettuceDetect-large 0.792); PsiloQA strong across 14 languages.

One small, fast model — no separate judge, no giant LLM — for span-level hallucination detection from RAG to agentic coding.

Overview

lettucedect-v2-qwen-2b is a generative hallucination detector for Retrieval-Augmented Generation (RAG) and coding-agent settings. Given a user request, the supporting context, and an answer, it emits the exact spans of the answer that are not supported by the context, each tagged with a hallucination category and subcategory. Unlike the encoder models in the LettuceDetect family (token classifiers), this is an instruction-tuned generative model that returns structured JSON, so it localizes and types hallucinations in a single pass — across prose and code.

It is the first LettuceDetect model trained on a unified benchmark spanning code (SWE-bench-derived coding-agent traces) and prose (RAGTruth, PsiloQA, and synthetic ACL / README / tool-output / Wikipedia sources), in 14 languages.

Model Details

Base model: Qwen3.5-2B (hybrid Gated DeltaNet + attention)
Training: LoRA supervised fine-tuning (full bf16), merged to a standalone model
Task: generative span detection — outputs {"hallucinated_spans": [{"text", "category", "subcategory"}]}
Taxonomy: 3 categories (contradiction, fabricated_reference, unsupported_addition) × 13 subcategories
Context: long-context (handles the full request + retrieved context + answer)
Languages: English plus 13 more (via the multilingual PsiloQA portion)

How It Works

The model is prompted with a fixed, data-agnostic system prompt that defines a hallucination ("a substring of the answer not supported by the context") and enumerates the taxonomy, followed by the user request, the context, and the answer to verify. It replies with a JSON object listing each hallucinated span verbatim with its category and subcategory; a fully supported answer returns an empty list. Span offsets are recovered by matching each returned substring back into the answer.

Usage

The model is a generative span detector: serve it with vLLM (OpenAI-compatible) and send the LettuceDetect detection prompt, which defines a hallucination and enumerates the taxonomy. It replies with JSON; match each returned text back into the answer to get character offsets.

vllm serve KRLabsOrg/lettucedect-v2-qwen-2b --served-model-name lettucedect-v2-qwen-2b

The model expects the exact detection prompt it was trained on (below). Send the user request + context as the user turn, ending with the answer to verify.

import json
from openai import OpenAI

SYSTEM = """You are an expert annotator who identifies hallucinated spans in a generated answer with respect to a given context (the only trusted evidence). A hallucinated span is a substring of the answer that is not supported by the context. Spans consistent with the context are not hallucinations.

Quote each hallucinated span verbatim from the answer and classify it into exactly one category and one subcategory.

Categories (the kinds of unsupported span):
- contradiction: conflicts with the context (a wrong value, number, date, name, or relationship)
- fabricated_reference: an entity, name, identifier, or section that is absent from the context
- unsupported_addition: a claim, detail, or behavior the context never states

Subcategories:
- entity: a wrong or invented name, entity, or object
- temporal: an incorrect date, time, duration, or ordering
- numerical: an incorrect number, quantity, or amount
- value: a wrong value, setting, or attribute value
- relational: an incorrect relationship or association between things
- identifier: an invented identifier or name not found in the context
- section: a reference to a section, part, or location that does not exist
- attribute: an invented or incorrect attribute or property
- claim: an added factual claim the context does not support
- behavior: an added or changed action or behavior the context never states
- elaboration: extra detail or elaboration beyond what the context supports
- subjective: an unsupported subjective or evaluative statement
- unspecified: unsupported, with no more specific subtype

Reply with ONLY a JSON object (no markdown, no code fences): {"hallucinated_spans": [{"text": "...", "category": "...", "subcategory": "..."}]}. If nothing is unsupported, reply {"hallucinated_spans": []}."""

client = OpenAI(base_url="http://localhost:8000/v1", api_key="EMPTY")
context = "France is a country in Europe. Its capital is Paris."
question = "What is the capital of France? What is its population?"
answer = "The capital of France is Paris. Its population is 2 million."
user = f"User request: {question}\n\n{context}\n\nAnswer to verify:\n{answer}"

resp = client.chat.completions.create(
    model="lettucedect-v2-qwen-2b", temperature=0.0,
    messages=[{"role": "system", "content": SYSTEM}, {"role": "user", "content": user}],
)
spans = json.loads(resp.choices[0].message.content)["hallucinated_spans"]
for s in spans:  # recover character offsets in the answer
    s["start"] = answer.find(s["text"])
    s["end"] = s["start"] + len(s["text"])
# [{"text": "Its population is 2 million.", "category": "unsupported_addition",
#   "subcategory": "numerical", "start": 32, "end": 60}]

Or via the LettuceDetect package: serve the model with vLLM, point OPENAI_API_BASE at it, and call

from lettucedetect.models.inference import HallucinationDetector
det = HallucinationDetector(method="llm", model="KRLabsOrg/lettucedect-v2-qwen-2b")
spans = det.predict(context=[context], question=question, answer=answer, output_format="spans")

It auto-routes this model to its native training prompt and hallucinated_spans output (typed category/subcategory). Pass native=True if served under a different name, and include_reasoning=True to request per-span explanations.

Plain `transformers` (no vLLM, no lettucedetect)

import json, torch
from transformers import AutoModelForCausalLM, AutoTokenizer

model_id = "KRLabsOrg/lettucedect-v2-qwen-2b"
tok = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(model_id, torch_dtype="auto", device_map="auto")

SYSTEM = "..."  # the detection prompt shown above (verbatim)
context = "France is a country in Europe. Its capital is Paris."
question = "What is the capital of France? What is its population?"
answer = "The capital of France is Paris. Its population is 2 million."
user = f"User request: {question}\n\n{context}\n\nAnswer to verify:\n{answer}"

inputs = tok.apply_chat_template(
    [{"role": "system", "content": SYSTEM}, {"role": "user", "content": user}],
    add_generation_prompt=True, enable_thinking=False, return_tensors="pt", return_dict=True,
).to(model.device)
out = model.generate(**inputs, max_new_tokens=512, do_sample=False)
reply = tok.decode(out[0, inputs["input_ids"].shape[1]:], skip_special_tokens=True)

# raw_decode reads the first JSON object and ignores any trailing text
spans = json.JSONDecoder().raw_decode(reply[reply.index("{"):])[0]["hallucinated_spans"]
for s in spans:  # recover character offsets in the answer
    s["start"] = answer.find(s["text"]); s["end"] = s["start"] + len(s["text"])
print(spans)
# -> [{'text': '2 million', 'category': 'unsupported_addition', 'subcategory': 'claim', 'start': 50, 'end': 59}]

Performance

Char-level span-F1 / example-F1 / IoU on the unified test set (10,698 samples). The model beats the mmBERT-base encoder on every source, span- and example-level.

dataset	n	span-F1	example-F1	IoU
ALL	10698	0.689	0.921	0.758
acl	440	0.749	0.942	0.811
code-agent	2015	0.602	0.835	0.707
readme	641	0.866	0.984	0.894
tool-output	617	0.719	0.907	0.793
wikipedia	1388	0.817	0.974	0.871
psiloqa (14 langs)	2897	0.732	0.966	0.687
ragtruth	2700	0.574	0.818	0.765

Its strength is unified cross-domain coverage — one model handling code, prose, and 14 languages — rather than a single-benchmark record.

Prose benchmarks: RAGTruth & PsiloQA

On the established prose benchmarks the same model is competitive with or better than specialized methods, so the code/tool sources extend rather than trade off against prose.

RAGTruth (official test set, n=2700) — example-level F1, on the standard leaderboard:

Method	Example-level F1
RAG-HAT (fine-tuned Llama-3-8B)	83.9
lettucedect-v2-qwen-2b (ours)	81.8
LettuceDetect-large (v1)	79.2
Fine-tuned Llama-2-13B (RAGTruth)	78.7
Luna	65.4
GPT-4	63.4

Second only to a fine-tuned 8B, and above the v1 detector, the fine-tuned 13B, Luna, and GPT-4. (Our span-level on RAGTruth: P 0.601 / R 0.548 / F1 0.574 / IoU 0.765.)

PsiloQA (multilingual) — span IoU. Our generative 2B matches PsiloQA's best fine-tuned encoder and is far above the strongest LLM judge:

Method (English)	IoU
lettucedect-v2-qwen-2b (ours)	0.724
mmBERT-base (PsiloQA paper, fine-tuned encoder)	0.707
Qwen2.5-32B-it, 3-shot (PsiloQA paper, LLM judge)	0.400

Across all 14 languages our IoU is 0.689. Numbers for the other methods are from the PsiloQA paper (arXiv 2510.04849); our IoU uses the unified char-overlap scorer, so cross-paper rows are indicative rather than an identical protocol. Full per-language results:

PsiloQA — per language (n=2897, 14 languages):

lang	n	span-F1	example-F1	IoU
ALL	2897	0.733	0.966	0.689
en	1098	0.785	0.987	0.724
es	100	0.817	0.973	0.685
ca	100	0.721	0.959	0.704
fi	200	0.683	0.963	0.727
it	99	0.680	0.984	0.724
cs	100	0.672	0.974	0.604
fa	100	0.650	0.977	0.789
hi	300	0.649	0.933	0.637
fr	100	0.634	0.944	0.661
zh	300	0.597	0.952	0.650
ar	100	0.594	0.966	0.606
de	100	0.587	0.901	0.650
sv	100	0.512	0.959	0.707
eu	100	0.511	0.890	0.569

Example-F1 stays high across all 14 languages (answer-level detection is reliable); span-F1 is strongest for higher-resource languages, as expected.

Baselines on code-agent

Code-agent (2,015 samples) is where off-the-shelf detectors and even frontier LLM judges collapse, and where this model's value is clearest:

detector	span-F1	example-F1	notes
lettucedect-v2-qwen-2b (this model)	0.602	0.835	balanced (span P 0.60 / R 0.61)
lettucedect-v2-lfm-8b (our 8B sibling)	0.507	0.811	scaling our own recipe to 8B does not help
lettucedect-v2-mmbert-base	0.508	0.770	encoder counterpart
lettucedect-large (v1, EN/RAGTruth span model)	0.172	0.684	trained on prose RAG
gpt-oss-120b (zero-shot judge, task-aware)	0.212	0.691	self-hosted, BAcc ≈ 0.50
Nemotron-3-Ultra-550B (zero-shot judge, naive prompt)	0.186	0.655	BAcc ≈ 0.50
Nemotron-3-Ultra-550B (task-aware prompt)	0.216	0.700	BAcc ≈ 0.50
HHEM-2.1 (sliding-window)	—	0.632	BAcc 0.497
Lynx-8B	—	0.609	BAcc 0.526
Granite-Guardian-4.1-8B	—	0.663	BAcc 0.538
MiniCheck-7B (claim-level)	—	0.670	flags every answer, BAcc 0.500

The off-the-shelf detectors and the 550B judge over-flag on code-agent (balanced accuracy near chance) because a generated code patch is not literally present in the context.

Citing

@article{Kovacs2025LettuceDetect,
  title={LettuceDetect: A Hallucination Detection Framework for RAG Applications},
  author={Kovács, Ádám and Recski, Gábor},
  journal={arXiv preprint arXiv:2502.17125},
  year={2025}
}

A dedicated paper for the unified code+prose benchmark and the v2 models is in preparation; this card will be updated with its citation on release.

Downloads last month: 11

Safetensors

Model size

2B params

Tensor type

F32

BF16

Model tree for KRLabsOrg/lettucedect-v2-qwen-2b

Base model

Qwen/Qwen3.5-2B-Base

Finetuned

Qwen/Qwen3.5-2B

Finetuned

(209)

this model

Quantizations

1 model

Datasets used to train KRLabsOrg/lettucedect-v2-qwen-2b

Paper for KRLabsOrg/lettucedect-v2-qwen-2b

LettuceDetect: A Hallucination Detection Framework for RAG Applications

Paper • 2502.17125 • Published Feb 24, 2025 • 14