Add Aquila model series which have gsm8k test set contamination

#21
by bpHigh - opened

What are you reporting:

  • Evaluation dataset(s) found in a pre-training corpus. (e.g. COPA found in ThePile)
  • Evaluation dataset(s) found in a pre-trained model. (e.g. FLAN T5 has been trained on ANLI)

Evaluation dataset(s): Name(s) of the evaluation dataset(s). If available in the HuggingFace Hub please write the path (e.g. uonlp/CulturaX), otherwise provide a link to a paper, GitHub or dataset-card.
gsm8k
Contaminated model(s): Name of the model(s) (if any) that have been contaminated with the evaluation dataset. If available in the HuggingFace Hub please list the corresponding paths (e.g. allenai/OLMo-7B).
Aquila2-34B , AquilaChat2-34B

Briefly describe your method to detect data contamination

  • Data-based approach
  • Model-based approach

Official Release Readme has information that the 34B parameter versions of Aquila2 series contain gsm8k test set data contamination in their pre-training dataset.

Screenshot 2024-05-05 at 7.17.12 PM.png

Citation

Is there a paper that reports the data contamination or describes the method used to detect data contamination?

URL: https://huggingface.co/BAAI/Aquila2-34B/blob/main/README.md , https://huggingface.co/BAAI/AquilaChat2-34B/blob/main/README.md
Citation:

Important! If you wish to be listed as an author in the final report, please complete this information for all the authors of this Pull Request.

Workshop on Data Contamination org

Hi @bpHigh !

Thank you for your contribution! Merging to main.

Best,
Oscar

OSainz changed pull request status to merged

Sign up or log in to comment