ModernBERT-large — Tool-Calling Hallucination Span Detector

Fine-tuned ModernBERT-large for token-level hallucination span detection in LLM tool-calling answers. Detects which answer spans are hallucinated (answer_mismatch / missing_tool / overgeneration).

Result

Span-level IoU F1 (greedy matching, IoU ≥ 0.5) on the s-nlp/toolace-unified-hallucinations test split:

Model	Span F1	answer_mismatch	missing_tool	overgeneration
Published baseline (`s-nlp/tool-calling-hallucination-modernbert-base-unified-final`)	0.9176	0.8432	0.9895	0.9373
This model	0.9407	0.8978	0.9933	0.9393

+2.31 points over the published base checkpoint (+2.5% relative). The answer_mismatch bottleneck class improves from 0.8432 → 0.8978 (+6.5% rel).

Usage

Two-segment tokenization (prompt + answer); predict per answer token:

from transformers import AutoTokenizer, AutoModelForTokenClassification
import torch

tok = AutoTokenizer.from_pretrained("etomoscow/tool-calling-hallucination-modernbert-large")
model = AutoModelForTokenClassification.from_pretrained("etomoscow/tool-calling-hallucination-modernbert-large").eval()

prompt = "<conversation context>"      # System: ... \n\n User: ... \n\n Tool: ...
answer = "<final answer>"
enc = tok(prompt, answer, truncation="only_first", max_length=4096, return_offsets_mapping=True, return_tensors="pt")
offsets = enc.pop("offset_mapping")
with torch.no_grad():
    preds = torch.argmax(model(**enc).logits, -1)[0]
# tokens where preds==1 are hallucination spans (map offsets back to answer chars)

Training

Backbone: answerdotai/ModernBERT-large (fine-tuned from the general backbone).
Data: full s-nlp/toolace-unified-hallucinations train split; answer_mismatch oversampled 3× (targets the bottleneck class).
LR 5e-5, cosine, effective batch 16, bf16, flash-attention 2, 8 epochs, max_len 4096.
Labels: per answer token, 1 if inside a gold hallucination span else 0.

License & attribution

MIT. Trained on s-nlp/toolace-unified-hallucinations. Built to improve on the s-nlp/tool-calling-hallucination-modernbert-base-unified-final baseline.

Downloads last month: 17

Safetensors

Model size

0.4B params

Tensor type

BF16

Model tree for etomoscow/tool-calling-hallucination-modernbert-large

Base model

answerdotai/ModernBERT-large

Finetuned

(302)

this model