Add evaluation results on the adversarialQA config of adversarial_qa
Beep boop, I am a bot from Hugging Face's automatic model evaluator π!
Your model has been evaluated on the adversarialQA config of the adversarial_qa dataset by
@ceyda
, using the predictions stored here.
Accept this pull request to see the results displayed on the Hub leaderboard.
Evaluate your model on more datasets here.
Closing this PR as the model was trained on SQuAD data. We think it might be confusing to users to have evaluation in adversarialQA data on this model, which was not trained on this type of data.
Out of curiosity, is this due to something specific about theadversarial_qa
dataset? I'm just wondering if you'd apply the same logic to other question answering datasets like covid_qa_deepset
?
When we developed the evaluation pipeline for question answering, our view was that models trained on SQuAD could be evaluated on any dataset that follows the SQuAD format. This would e.g. allow users to know whether domain adaptation is required on the vanilla SQuAD model for a given dataset.
Note: it's perfectly fine to reject this evaluation. I'm simply curious on whether my intuition about evaluating QA models is wrong :)
We would really like to accumulate these metrics for all question-answering datasets! However, as @Tuana mentioned we think posting the eval results as they are shown right now might confuse users into thinking that this model was also trained on that dataset.
If it is possible to have "In Domain" and "Out of Domain" labels (or something similar) in the Evaluation Results section then I think we can make it clear to users that fine-tuning this model on that dataset could boost performance and we would like to add results like adversarial_qa to that "Out of Domain" section.
The adversarial_qa dataset contains training, validation and test splits, (10,000 training, 1,000 validation, and 1,000 test samples if I am not mistaken). Therefore, in my opinion, it makes sense to compare models on its test split that have been finetuned using its training split. I could also imagine that one might want to compare models on its test split that have not been finetuned using its training split. That matches your vanilla SQuAD model/domain adaptation example. However, I wouldn't mix these two kinds of models. It's not surprising that models finetuned on the dataset are better than models not finetuned on it. @lewtun Maybe an idea could be to have two separate evaluation dashboards of test split metrics per dataset? One for models not finetuned on the training split and the other for models finetuned on the training split? I would find mixing both kinds of models in one dashboard confusing.
covid_qa_deepset is much smaller and does not have designated splits (2000 samples in total). If it had splits, I would also prefer to have two separate evaluation dashboards for that dataset. :)
Thanks @sjrlee and @julianrisch for this insightful discussion!
We would really like to accumulate these metrics for all question-answering datasets! However, as @Tuana mentioned we think posting the eval results as they are shown right now might confuse users into thinking that this model was also trained on that dataset.
Good point - note that we do explicitly show which dataset the model is trained on in the model page:
Also, in case you're not aware, you can add your own self-reported evaluation results to the model card by following the model-index
spec defined here. This would allow you to signal to users that the core results are from SQuAD vs other evaluations performed on other datasets. Here's an example of what I am talking about:
Therefore, in my opinion, it makes sense to compare models on its test split that have been finetuned using its training split. I could also imagine that one might want to compare models on its test split that have not been finetuned using its training split. That matches your vanilla SQuAD model/domain adaptation example. However, I wouldn't mix these two kinds of models.
Thanks for sharing this interesting perspective. You're right that models finetuned / evaluated on the same corpus will typically fare better than those evaluated in a cross-domain fashion. I'll have a think about how complex the leaderboard UX would be for distinguishing these cases, although my impression is that most Hub users are used to looking at single leaderboards per dataset.