CONDA-Workshop/Data-Contamination-Database · Add Aquila model series which have gsm8k test set contamination

May 5

What are you reporting:

Evaluation dataset(s) found in a pre-training corpus. (e.g. COPA found in ThePile)
Evaluation dataset(s) found in a pre-trained model. (e.g. FLAN T5 has been trained on ANLI)

Evaluation dataset(s): Name(s) of the evaluation dataset(s). If available in the HuggingFace Hub please write the path (e.g. uonlp/CulturaX), otherwise provide a link to a paper, GitHub or dataset-card.
gsm8k
Contaminated model(s): Name of the model(s) (if any) that have been contaminated with the evaluation dataset. If available in the HuggingFace Hub please list the corresponding paths (e.g. allenai/OLMo-7B).
Aquila2-34B , AquilaChat2-34B

Briefly describe your method to detect data contamination

Data-based approach
Model-based approach

Official Release Readme has information that the 34B parameter versions of Aquila2 series contain gsm8k test set data contamination in their pre-training dataset.

Citation

Is there a paper that reports the data contamination or describes the method used to detect data contamination?

URL: https://huggingface.co/BAAI/Aquila2-34B/blob/main/README.md , https://huggingface.co/BAAI/AquilaChat2-34B/blob/main/README.md
Citation:

Important! If you wish to be listed as an author in the final report, please complete this information for all the authors of this Pull Request.

Full name: Bhavish Pahwa
Institution: Microsoft Research
Email: t-bpahwa@microsoft.com

Add Aquila model series which have gsm8k test set contamination9cf7873a

Merge remote-tracking branch 'origin/main' into pr/21446d97b8

Add PR number + postprocessingecb29a1b

OSainz

Workshop on Data Contamination org May 6

Hi @bpHigh !

Thank you for your contribution! Merging to main.

Best,
Oscar

OSainz changed pull request status to merged May 6