Spaces:

hallucinations-leaderboard
/

leaderboard

Running on CPU Upgrade

Metrics for hallucination detection for summarization.

by rohitsaxena - opened Jan 24, 2024

hallucinations-leaderboard org Jan 24, 2024

The metric for summarization currently reports aggregated ROUGE scores, which measure the overlap between the generated and the reference summary. However, it may not necessarily capture hallucinations.
More recent work has utilized QA and NLI-based methods for detecting hallucinations in abstractive summarization, and we can also report scores using these approaches.

QAFactEval (https://aclanthology.org/2022.naacl-main.187/)
TrueTeacher (https://aclanthology.org/2023.emnlp-main.127/)

zorik

Jan 31, 2024

TrueTeacher is available at google/t5_11b_trueteacher_and_anli

pminervini

hallucinations-leaderboard org Feb 5, 2024

Yeah one problem is that I'm not sure that the harness unloads the model before starting evaluation (I can check), so the 11B model might not fit into memory -- let's see!
Is there anything smaller we can use @zorik @rohitsaxena ?

rohitsaxena

hallucinations-leaderboard org Feb 5, 2024

•

edited Feb 5, 2024

Another work scale score (EMNLP 2023) supports relatively smaller models. FLAN T5 large is one of the recommended models for best results.
https://github.com/asappresearch/scale-score
Note: Empirically I found its performance is not at par with TrueTeacher

lauhaide

Apr 17, 2024

We could also evaluate this metric: https://pypi.org/project/infuz/ ?

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment