Model points to wrong dataset

#1
by ArthurBaia - opened

Hello There, lately i've been using the SQuAD in Portuguese dataset (link to it: https://drive.google.com/file/d/1Q0IaIlv2h2BC468MwUFmUST0EyN7gNkn/view?usp=sharing ) that this particular model uses, and i noticed that the hub is pointing to a different SQuAD in Portuguese (The one in the image below):
Screenshot from 2022-07-13 17-17-34.png

There's a problem in the hub or in the model description. It leads to a mistakenly usage of the wrong data and in the worse it can produce a much worse model (Compared to Pierre's one).
I've trained two models, one with Pierre's data and another with the hub data, and the results pretty far from each other:

Pierre's SQuAD-pt Dataset -> F1 = 82% and EM = 70 %
Hugging Face's SQuAD-pt Dataset -> F1 = 62 % and EM = 51%

You can check the colab notebook and the models in the links below:
Training link = https://colab.research.google.com/drive/1FaUrktnvgKBQa3sI4Tfuyve6iJQceUuE
Validation link = https://colab.research.google.com/drive/1MeFWvLWxGNusOZvCwY9P3GQSbYdBIC1X?usp=sharing
Model trained with Pierre's dataset = https://drive.google.com/drive/folders/108eX1kCYe4BmkEmQLJGoPqqBzN9ktPuA?usp=sharing
Model trained with Hugging Face's dataset = https://drive.google.com/drive/folders/11T_9_zEuiDcJsvOapZF9e8BgyjYA1lqA?usp=sharing

I guess there are two ways of solving this problem:

  1. Remove the pointing of this model to that dataset (SQuAD_v1_pt)
  2. Add Pierre's dataset, which is much better than the hub one) into the hub

Just created a repo to upload the Brazilian Portuguese version https://huggingface.co/datasets/ArthurBaia/SQuAD_v1.1_pt-br

Sign up or log in to comment