Add evaluation results on the squad_v2 config of squad_v2

by autoevaluator HF staff - opened Jul 26, 2022

base: refs/heads/main

←

from: refs/pr/1

Discussion Files changed

+20

-0

autoevaluator

Jul 26, 2022

Beep boop, I am a bot from Hugging Face's automatic model evaluator 👋!
Your model has been evaluated on the squad_v2 config of the squad_v2 dataset by @sjrlee, using the predictions stored here.
Accept this pull request to see the results displayed on the Hub leaderboard.
Evaluate your model on more datasets here.

Add evaluation results on the squad_v2 config of squad_v28adbba99

sjrhuschlee

deepset org Jul 26, 2022

I noticed there is a somewhat large discrepancy between the reported Performance of the model on the model card (EM: 79.8; F1: 83.9) and the calculated metrics from HF (EM: 75.2; F1: 78.3). It is probably worth investigating the reason causing these large differences.

Tuana

Aug 4, 2022

Hey @sjrlee as far as I can see the reported performance on the model card is
"exact": 74.06721131980123%
"f1": 76.39919553344667%

we noticed something similar with @julianrisch with another model earlier and we saw that the difference was small enough to be not considered. Wdyt about this one?

sjrhuschlee

deepset org Aug 4, 2022

@Tuana ahh right so the model card was actually incorrect before and I believe @MichelBartels updated it to the values you see now. So in short go ahead and merge these results now, I'd say the difference is small enough not to worry about.

julianrisch

deepset org Aug 4, 2022

The differences are something we should investigate further and adjust our reporting of the numbers. I'd suggest you create an issue in Haystack to check the performance of the model. Might have to do with how we handle no answers maybe?

sjrhuschlee

deepset org Aug 4, 2022

Yes, it's possible, but I think that @MichelBartels knew about this and turned off using confidence scores to calculate these numbers so I don't think this difference comes from the no-answers. But yes the difference should be investigated in general. We would also need find to out how the autoevaluator works.

lewtun

Aug 16, 2022

Hi deepsetters!

In case it's helpful for your comparisons, we compute the EM / F1 scores using the same logic as in the question answering scripts from the transformers repo: https://github.com/huggingface/transformers/blob/main/examples/pytorch/question-answering/run_qa.py

In particular, we use the default value of 0 for the threshold to select a null answer: https://github.com/huggingface/transformers/blob/fd9aa82b07d9b844a21f18f1622de5ca104f25bd/examples/pytorch/question-answering/run_qa.py#L169

Tuana

Aug 16, 2022

Thanks @lewtun
@julianrisch and @sjrlee - I'll create an issue on Haystack about this anyway, just in case we want to investigate further. I'll post the links from Lewis in there too. If you conclude we don't have to further investigate, we can just close the issue there and merge this PR too 👍

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

Ready to merge

This branch is ready to get merged automatically.

· Sign up or log in to comment