Tool-Calling Hallucination Span Detector β€” Large + CRF Ensemble (0.9478)

A weighted probability ensemble of 4 ModernBERT span models (3 large CE + 1 base CRF) with span post-processing. Reproduces the best result from the experiment campaign: 0.9478 span-level IoU F1 on the s-nlp/toolace-unified-hallucinations test split.

Result

Span-level IoU F1 (greedy matching, IoU β‰₯ 0.5):

System Span F1 answer_mismatch missing_tool overgeneration
Published baseline 0.9176 0.8432 0.9895 0.9373
This ensemble 0.9478 0.9237 0.9951 0.9426

+3.02 points over the published base checkpoint (+3.3% relative); the answer_mismatch bottleneck improves 0.8432 β†’ 0.9237 (+9.6% rel).

Members

Member Backbone / head Span F1 weight
members/run_13 ModernBERT-large, CE 0.9350 1.5
members/run_20 ModernBERT-large + AM3Γ—, CE 0.9407 1.5
members/run_27 ModernBERT-large + AM3Γ— (seed2), CE 0.9387 1.5
members/run_22 ModernBERT-base + linear-chain CRF 0.9301 1.0

Recipe

  1. Per answer token, each CE member outputs P(hallucination) via softmax; the CRF member outputs a hard {0,1} via Viterbi.
  2. Weighted average across members β†’ ensemble P.
  3. Threshold at 0.55 (tuned) β†’ binary per-token prediction.
  4. Post-processing: drop spans < 2 tokens; greedy IoU matching @ 0.5.

Usage

python inference_ensemble.py \
  --checkpoints members/run_13/final_model,members/run_20/final_model,members/run_27/final_model \
  --crf_checkpoint members/run_22 \
  --weights 1.5,1.5,1.5,1 --threshold 0.55 --min_span 2 --gap 0

inference_ensemble.py accepts --input_json (a list of {system, conversations}) and emits predicted spans per item.

Why this works

  • Capacity (ModernBERT-large) is what moves the answer_mismatch bottleneck.
  • Diversity beats count: correlated fine-tunes dilute; the CRF member (different architecture) and AM-oversampled large members are complementary.
  • Post-processing (threshold + min-span) cuts false positives on empty-span classes (clean / undergeneration).

Notes / future work

  • members/run_22 (CRF) uses a custom architecture loaded by CRFM in inference_ensemble.py (final_model/crf_model.pt + encoder/).
  • A Large+CRF (CRF head initialized from a tuned large encoder, span F1 0.9406, precision 0.966) was trained but not yet integrated into this ensemble β€” a promising next member (a GPU fault interrupted the integration test).

License & attribution

MIT. Trained on s-nlp/toolace-unified-hallucinations. Builds on the s-nlp/tool-calling-hallucination-modernbert-base-unified-final baseline and answerdotai/ModernBERT-large.

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support