Tool-Calling Hallucination Span Detector — Large + CRF Ensemble (0.9478)

A weighted probability ensemble of 4 ModernBERT span models (3 large CE + 1 base CRF) with span post-processing. Reproduces the best result from the experiment campaign: 0.9478 span-level IoU F1 on the s-nlp/toolace-unified-hallucinations test split.

Result

Span-level IoU F1 (greedy matching, IoU ≥ 0.5):

System	Span F1	answer_mismatch	missing_tool	overgeneration
Published baseline	0.9176	0.8432	0.9895	0.9373
This ensemble	0.9478	0.9237	0.9951	0.9426

+3.02 points over the published base checkpoint (+3.3% relative); the answer_mismatch bottleneck improves 0.8432 → 0.9237 (+9.6% rel).

Members

Member	Backbone / head	Span F1	weight
`members/run_13`	ModernBERT-large, CE	0.9350	1.5
`members/run_20`	ModernBERT-large + AM3×, CE	0.9407	1.5
`members/run_27`	ModernBERT-large + AM3× (seed2), CE	0.9387	1.5
`members/run_22`	ModernBERT-base + linear-chain CRF	0.9301	1.0

Recipe

Per answer token, each CE member outputs P(hallucination) via softmax; the CRF member outputs a hard {0,1} via Viterbi.
Weighted average across members → ensemble P.
Threshold at 0.55 (tuned) → binary per-token prediction.
Post-processing: drop spans < 2 tokens; greedy IoU matching @ 0.5.

Usage

python inference_ensemble.py \
  --checkpoints members/run_13/final_model,members/run_20/final_model,members/run_27/final_model \
  --crf_checkpoint members/run_22 \
  --weights 1.5,1.5,1.5,1 --threshold 0.55 --min_span 2 --gap 0

inference_ensemble.py accepts --input_json (a list of {system, conversations}) and emits predicted spans per item.

Why this works

Capacity (ModernBERT-large) is what moves the answer_mismatch bottleneck.
Diversity beats count: correlated fine-tunes dilute; the CRF member (different architecture) and AM-oversampled large members are complementary.
Post-processing (threshold + min-span) cuts false positives on empty-span classes (clean / undergeneration).

Notes / future work

members/run_22 (CRF) uses a custom architecture loaded by CRFM in inference_ensemble.py (final_model/crf_model.pt + encoder/).
A Large+CRF (CRF head initialized from a tuned large encoder, span F1 0.9406, precision 0.966) was trained but not yet integrated into this ensemble — a promising next member (a GPU fault interrupted the integration test).

License & attribution

MIT. Trained on s-nlp/toolace-unified-hallucinations. Builds on the s-nlp/tool-calling-hallucination-modernbert-base-unified-final baseline and answerdotai/ModernBERT-large.

Downloads last month: -; Downloads are not tracked for this model. How to track