Tool-Calling Hallucination Span Detector β Large + CRF Ensemble (0.9478)
A weighted probability ensemble of 4 ModernBERT span models (3 large CE +
1 base CRF) with span post-processing. Reproduces the best result from the
experiment campaign: 0.9478 span-level IoU F1 on the
s-nlp/toolace-unified-hallucinations test split.
Result
Span-level IoU F1 (greedy matching, IoU β₯ 0.5):
| System | Span F1 | answer_mismatch | missing_tool | overgeneration |
|---|---|---|---|---|
| Published baseline | 0.9176 | 0.8432 | 0.9895 | 0.9373 |
| This ensemble | 0.9478 | 0.9237 | 0.9951 | 0.9426 |
+3.02 points over the published base checkpoint (+3.3% relative); the
answer_mismatch bottleneck improves 0.8432 β 0.9237 (+9.6% rel).
Members
| Member | Backbone / head | Span F1 | weight |
|---|---|---|---|
members/run_13 |
ModernBERT-large, CE | 0.9350 | 1.5 |
members/run_20 |
ModernBERT-large + AM3Γ, CE | 0.9407 | 1.5 |
members/run_27 |
ModernBERT-large + AM3Γ (seed2), CE | 0.9387 | 1.5 |
members/run_22 |
ModernBERT-base + linear-chain CRF | 0.9301 | 1.0 |
Recipe
- Per answer token, each CE member outputs P(hallucination) via softmax; the CRF member outputs a hard {0,1} via Viterbi.
- Weighted average across members β ensemble P.
- Threshold at 0.55 (tuned) β binary per-token prediction.
- Post-processing: drop spans < 2 tokens; greedy IoU matching @ 0.5.
Usage
python inference_ensemble.py \
--checkpoints members/run_13/final_model,members/run_20/final_model,members/run_27/final_model \
--crf_checkpoint members/run_22 \
--weights 1.5,1.5,1.5,1 --threshold 0.55 --min_span 2 --gap 0
inference_ensemble.py accepts --input_json (a list of {system, conversations})
and emits predicted spans per item.
Why this works
- Capacity (ModernBERT-large) is what moves the
answer_mismatchbottleneck. - Diversity beats count: correlated fine-tunes dilute; the CRF member (different architecture) and AM-oversampled large members are complementary.
- Post-processing (threshold + min-span) cuts false positives on empty-span classes (clean / undergeneration).
Notes / future work
members/run_22(CRF) uses a custom architecture loaded byCRFMininference_ensemble.py(final_model/crf_model.pt+encoder/).- A Large+CRF (CRF head initialized from a tuned large encoder, span F1 0.9406, precision 0.966) was trained but not yet integrated into this ensemble β a promising next member (a GPU fault interrupted the integration test).
License & attribution
MIT. Trained on s-nlp/toolace-unified-hallucinations. Builds on the
s-nlp/tool-calling-hallucination-modernbert-base-unified-final baseline and
answerdotai/ModernBERT-large.