Spaces:

CONDA-Workshop
/

Data-Contamination-Database

Running

App Files Files Community

davidstap commited on May 1, 2024

Commit

27412a4

verified ·

1 Parent(s): 95be02e

add flores contamination in xP3

Browse files

## What are you reporting:
- [x] Evaluation dataset(s) found in a pre-training corpus. (e.g. COPA found in ThePile)
- [ ] Evaluation dataset(s) found in a pre-trained model. (e.g. FLAN T5 has been trained on ANLI)

**Evaluation dataset(s)**: Name(s) of the evaluation dataset(s). If available in the HuggingFace Hub please write the path (e.g. `uonlp/CulturaX`), otherwise provide a link to a paper, GitHub or dataset-card.

* `facebook/flores`

**Contaminated model(s)**: Name of the model(s) (if any) that have been contaminated with the evaluation dataset. If available in the HuggingFace Hub please list the corresponding paths (e.g. `allenai/OLMo-7B`).

All models trained on `bigscience/xP3`:
* `bigscience/bloomz`
* `bigscience/bloomz-560m`
* `bigscience/bloomz-1b1`
* `bigscience/bloomz-1b7`
* `bigscience/bloomz-3b`
* `bigscience/bloomz-7b1`
* `bigscience/mt0-small`
* `bigscience/mt0-base`
* `bigscience/mt0-large`
* `bigscience/mt0-xl`
* `bigscience/mt0-xxl`

**Contaminated corpora**: Name of the corpora used to pretrain models (if any) that have been contaminated with the evaluation dataset. If available in the HuggingFace hub please write the path (e.g. `CohereForAI/aya_dataset`)

* `bigscience/xP3`

**Contaminated split(s)**: If the dataset has Train, Development and/or Test splits please report the contaminated split(s). You can report a percentage of the dataset contaminated; if the entire dataset is compromised, report 100%.

From the xP3 paper it is unclear which split is used (`dev`, `devtest`, or both). I manually checked the dataset and data files with `flores` all have length 997, indicating that the `dev` set of same length is used.

> You may also report instances where there is no contamination. In such cases, follow the previous instructions but report a contamination level of 0%.

## Briefly describe your method to detect data contamination

- [x] Data-based approach
- [ ] Model-based approach

https://arxiv.org/pdf/2304.04675 points out that "BLOOMZ is instruction- tuned with XP3 dataset (Scao et al., 2022), which includes FLORES-200 dataset.". This is mentioned in the xP3 paper as well, although very little detail is provided.

Manual inspection clearly shows that the `dev` set of size 997 is included in the dataset. Even though the `devtest` split is not included, it is still undesirable to train models on the `dev` split.

## Citation

Is there a paper that reports the data contamination or describes the method used to detect data contamination?

URL: `https://aclanthology.org/2023.acl-long.891.pdf`
Citation:
```
@inproceedings{muennighoff-etal-2023-crosslingual,
title = "Crosslingual Generalization through Multitask Finetuning",
author = "Muennighoff, Niklas and
Wang, Thomas and
Sutawika, Lintang and
Roberts, Adam and
Biderman, Stella and
Le Scao, Teven and
Bari, M Saiful and
Shen, Sheng and
Yong, Zheng Xin and
Schoelkopf, Hailey and
Tang, Xiangru and
Radev, Dragomir and
Aji, Alham Fikri and
Almubarak, Khalid and
Albanie, Samuel and
Alyafeai, Zaid and
Webson, Albert and
Raff, Edward and
Raffel, Colin",
editor = "Rogers, Anna and
Boyd-Graber, Jordan and
Okazaki, Naoaki",
booktitle = "Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)",
month = jul,
year = "2023",
address = "Toronto, Canada",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/2023.acl-long.891",
doi = "10.18653/v1/2023.acl-long.891",
pages = "15991--16111",
abstract = "Multitask prompted finetuning (MTF) has been shown to help large language models generalize to new tasks in a zero-shot setting, but so far explorations of MTF have focused on English data and models. We apply MTF to the pretrained multilingual BLOOM and mT5 model families to produce finetuned variants called BLOOMZ and mT0. We find finetuning large multilingual language models on English tasks with English prompts allows for task genrealization to non-English languages that appear only in the pretraining corpus. Finetuning on multilingual tasks with English prompts further improves performance on English and non-English tasks leading to various state-of-the-art zero-shot results. We also investigate finetuning on multilingual tasks with prompts that have been machine-translated from English to match the language of each dataset. We find training on these machine-translated prompts leads to better performance on human-written prompts in the respective languages. Surprisingly, we find models are capable of zero-shot generalization to tasks in languages they have never intentionally seen. We conjecture that the models are learning higher-level capabilities that are both task- and language-agnostic. In addition, we introduce xP3, a composite of supervised datasets in 46 languages with English and machine-translated prompts. Our code, datasets and models are freely available at \url{https://github.com/bigscience-workshop/xmtf}.",
}
```

Additionally I include the paper that points out the contamination issue:
URL: `https://arxiv.org/pdf/2304.04675`
Citation:
```

@article
{zhu2023multilingual,
title={Multilingual machine translation with large language models: Empirical results and analysis},
author={Zhu, Wenhao and Liu, Hongyi and Dong, Qingxiu and Xu, Jingjing and Huang, Shujian and Kong, Lingpeng and Chen, Jiajun and Li, Lei},
journal={arXiv preprint arXiv:2304.04675},
year={2023}
}
```

*Important!* If you wish to be listed as an author in the final report, please complete this information for all the authors of this Pull Request.
- Full name: David Stap
- Institution: University of Amsterdam
- Email: dd.stap@gmail.com

Files changed (1) hide show

contamination_report.csv +13 -0

contamination_report.csv CHANGED Viewed

@@ -143,6 +143,19 @@ facebook/anli;test_r2;GPT-3;;model;;;18.0;data-based;https://arxiv.org/abs/2005.
 facebook/anli;test_r3;GPT-3;;model;;;16.0;data-based;https://arxiv.org/abs/2005.14165;13
 gigaword;;EleutherAI/pile;;corpus;;;1.18;data-based;https://arxiv.org/abs/2310.20707;2
 gigaword;;allenai/c4;;corpus;;;0.15;data-based;https://arxiv.org/abs/2310.20707;2
 gigaword;;oscar-corpus/OSCAR-2301;;corpus;;;0.36;data-based;https://arxiv.org/abs/2310.20707;2

 facebook/anli;test_r3;GPT-3;;model;;;16.0;data-based;https://arxiv.org/abs/2005.14165;13
+facebook/flores;;bigscience/xP3;;corpus;;;;data-based;https://aclanthology.org/2023.acl-long.891/;
+facebook/flores;;bigscience/bloomz;;model;;;;data-based;https://aclanthology.org/2023.acl-long.891/;
+facebook/flores;;bigscience/bloomz-1b1;;model;;;;data-based;https://aclanthology.org/2023.acl-long.891/;
+facebook/flores;;bigscience/bloomz-1b7;;model;;;;data-based;https://aclanthology.org/2023.acl-long.891/;
+facebook/flores;;bigscience/bloomz-3b;;model;;;;data-based;https://aclanthology.org/2023.acl-long.891/;
+facebook/flores;;bigscience/bloomz-560m;;model;;;;data-based;https://aclanthology.org/2023.acl-long.891/;
+facebook/flores;;bigscience/bloomz-7b1;;model;;;;data-based;https://aclanthology.org/2023.acl-long.891/;
+facebook/flores;;bigscience/mt0-base;;model;;;;data-based;https://aclanthology.org/2023.acl-long.891/;
+facebook/flores;;bigscience/mt0-large;;model;;;;data-based;https://aclanthology.org/2023.acl-long.891/;
+facebook/flores;;bigscience/mt0-small;;model;;;;data-based;https://aclanthology.org/2023.acl-long.891/;
+facebook/flores;;bigscience/mt0-xl;;model;;;;data-based;https://aclanthology.org/2023.acl-long.891/;
+facebook/flores;;bigscience/mt0-xxl;;model;;;;data-based;https://aclanthology.org/2023.acl-long.891/;
 gigaword;;EleutherAI/pile;;corpus;;;1.18;data-based;https://arxiv.org/abs/2310.20707;2
 gigaword;;allenai/c4;;corpus;;;0.15;data-based;https://arxiv.org/abs/2310.20707;2
 gigaword;;oscar-corpus/OSCAR-2301;;corpus;;;0.36;data-based;https://arxiv.org/abs/2310.20707;2