Metrics for hallucination detection for summarization.

#6
by rohitsaxena - opened
hallucinations-leaderboard org

The metric for summarization currently reports aggregated ROUGE scores, which measure the overlap between the generated and the reference summary. However, it may not necessarily capture hallucinations.
More recent work has utilized QA and NLI-based methods for detecting hallucinations in abstractive summarization, and we can also report scores using these approaches.

  1. QAFactEval (https://aclanthology.org/2022.naacl-main.187/)
  2. TrueTeacher (https://aclanthology.org/2023.emnlp-main.127/)

TrueTeacher is available at google/t5_11b_trueteacher_and_anli

hallucinations-leaderboard org

Yeah one problem is that I'm not sure that the harness unloads the model before starting evaluation (I can check), so the 11B model might not fit into memory -- let's see!
Is there anything smaller we can use @zorik @rohitsaxena ?

hallucinations-leaderboard org
โ€ข
edited Feb 5

Another work scale score (EMNLP 2023) supports relatively smaller models. FLAN T5 large is one of the recommended models for best results.
https://github.com/asappresearch/scale-score
Note: Empirically I found its performance is not at par with TrueTeacher

We could also evaluate this metric: https://pypi.org/project/infuz/ ?

Sign up or log in to comment