Add evaluation results on the squad_v2 config of squad_v2

#1
by autoevaluator HF staff - opened

Beep boop, I am a bot from Hugging Face's automatic model evaluator πŸ‘‹!
Your model has been evaluated on the squad_v2 config of the squad_v2 dataset by @sjrlee, using the predictions stored here.
Accept this pull request to see the results displayed on the Hub leaderboard.
Evaluate your model on more datasets here.

I noticed there is a somewhat large discrepancy between the reported Performance of the model on the model card (EM: 79.8; F1: 83.9) and the calculated metrics from HF (EM: 75.2; F1: 78.3). It is probably worth investigating the reason causing these large differences.

deepset org

Hey @sjrlee as far as I can see the reported performance on the model card is
"exact": 74.06721131980123%
"f1": 76.39919553344667%

  • we noticed something similar with @julianrisch with another model earlier and we saw that the difference was small enough to be not considered. Wdyt about this one?
deepset org

@Tuana ahh right so the model card was actually incorrect before and I believe @MichelBartels updated it to the values you see now. So in short go ahead and merge these results now, I'd say the difference is small enough not to worry about.

deepset org

The differences are something we should investigate further and adjust our reporting of the numbers. I'd suggest you create an issue in Haystack to check the performance of the model. Might have to do with how we handle no answers maybe?

deepset org

Yes, it's possible, but I think that @MichelBartels knew about this and turned off using confidence scores to calculate these numbers so I don't think this difference comes from the no-answers. But yes the difference should be investigated in general. We would also need find to out how the autoevaluator works.

Hi deepsetters!

In case it's helpful for your comparisons, we compute the EM / F1 scores using the same logic as in the question answering scripts from the transformers repo: https://github.com/huggingface/transformers/blob/main/examples/pytorch/question-answering/run_qa.py

In particular, we use the default value of 0 for the threshold to select a null answer: https://github.com/huggingface/transformers/blob/fd9aa82b07d9b844a21f18f1622de5ca104f25bd/examples/pytorch/question-answering/run_qa.py#L169

deepset org

Thanks @lewtun
@julianrisch and @sjrlee - I'll create an issue on Haystack about this anyway, just in case we want to investigate further. I'll post the links from Lewis in there too. If you conclude we don't have to further investigate, we can just close the issue there and merge this PR too πŸ‘

Ready to merge
This branch is ready to get merged automatically.

Sign up or log in to comment