add flores contamination in xP3

#20

What are you reporting:

  • Evaluation dataset(s) found in a pre-training corpus. (e.g. COPA found in ThePile)
  • Evaluation dataset(s) found in a pre-trained model. (e.g. FLAN T5 has been trained on ANLI)

Evaluation dataset(s): Name(s) of the evaluation dataset(s). If available in the HuggingFace Hub please write the path (e.g. uonlp/CulturaX), otherwise provide a link to a paper, GitHub or dataset-card.

  • facebook/flores

Contaminated model(s): Name of the model(s) (if any) that have been contaminated with the evaluation dataset. If available in the HuggingFace Hub please list the corresponding paths (e.g. allenai/OLMo-7B).

All models trained on bigscience/xP3:

  • bigscience/bloomz
  • bigscience/bloomz-560m
  • bigscience/bloomz-1b1
  • bigscience/bloomz-1b7
  • bigscience/bloomz-3b
  • bigscience/bloomz-7b1
  • bigscience/mt0-small
  • bigscience/mt0-base
  • bigscience/mt0-large
  • bigscience/mt0-xl
  • bigscience/mt0-xxl

Contaminated corpora: Name of the corpora used to pretrain models (if any) that have been contaminated with the evaluation dataset. If available in the HuggingFace hub please write the path (e.g. CohereForAI/aya_dataset)

  • bigscience/xP3

Contaminated split(s): If the dataset has Train, Development and/or Test splits please report the contaminated split(s). You can report a percentage of the dataset contaminated; if the entire dataset is compromised, report 100%.

From the xP3 paper it is unclear which split is used (dev, devtest, or both). I manually checked the dataset and data files with flores all have length 997, indicating that the dev set of same length is used.

You may also report instances where there is no contamination. In such cases, follow the previous instructions but report a contamination level of 0%.

Briefly describe your method to detect data contamination

  • Data-based approach
  • Model-based approach

https://arxiv.org/pdf/2304.04675 points out that "BLOOMZ is instruction- tuned with XP3 dataset (Scao et al., 2022), which includes FLORES-200 dataset.". This is mentioned in the xP3 paper as well, although very little detail is provided.

Manual inspection clearly shows that the dev set of size 997 is included in the dataset. Even though the devtest split is not included, it is still undesirable to train models on the dev split.

Citation

Is there a paper that reports the data contamination or describes the method used to detect data contamination?

URL: https://aclanthology.org/2023.acl-long.891.pdf
Citation:

@inproceedings{muennighoff-etal-2023-crosslingual,
    title = "Crosslingual Generalization through Multitask Finetuning",
    author = "Muennighoff, Niklas  and
      Wang, Thomas  and
      Sutawika, Lintang  and
      Roberts, Adam  and
      Biderman, Stella  and
      Le Scao, Teven  and
      Bari, M Saiful  and
      Shen, Sheng  and
      Yong, Zheng Xin  and
      Schoelkopf, Hailey  and
      Tang, Xiangru  and
      Radev, Dragomir  and
      Aji, Alham Fikri  and
      Almubarak, Khalid  and
      Albanie, Samuel  and
      Alyafeai, Zaid  and
      Webson, Albert  and
      Raff, Edward  and
      Raffel, Colin",
    editor = "Rogers, Anna  and
      Boyd-Graber, Jordan  and
      Okazaki, Naoaki",
    booktitle = "Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)",
    month = jul,
    year = "2023",
    address = "Toronto, Canada",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2023.acl-long.891",
    doi = "10.18653/v1/2023.acl-long.891",
    pages = "15991--16111",
    abstract = "Multitask prompted finetuning (MTF) has been shown to help large language models generalize to new tasks in a zero-shot setting, but so far explorations of MTF have focused on English data and models. We apply MTF to the pretrained multilingual BLOOM and mT5 model families to produce finetuned variants called BLOOMZ and mT0. We find finetuning large multilingual language models on English tasks with English prompts allows for task genrealization to non-English languages that appear only in the pretraining corpus. Finetuning on multilingual tasks with English prompts further improves performance on English and non-English tasks leading to various state-of-the-art zero-shot results. We also investigate finetuning on multilingual tasks with prompts that have been machine-translated from English to match the language of each dataset. We find training on these machine-translated prompts leads to better performance on human-written prompts in the respective languages. Surprisingly, we find models are capable of zero-shot generalization to tasks in languages they have never intentionally seen. We conjecture that the models are learning higher-level capabilities that are both task- and language-agnostic. In addition, we introduce xP3, a composite of supervised datasets in 46 languages with English and machine-translated prompts. Our code, datasets and models are freely available at \url{https://github.com/bigscience-workshop/xmtf}.",
}

Additionally I include the paper that points out the contamination issue:
URL: https://arxiv.org/pdf/2304.04675
Citation:



@article

	{zhu2023multilingual,
  title={Multilingual machine translation with large language models: Empirical results and analysis},
  author={Zhu, Wenhao and Liu, Hongyi and Dong, Qingxiu and Xu, Jingjing and Huang, Shujian and Kong, Lingpeng and Chen, Jiajun and Li, Lei},
  journal={arXiv preprint arXiv:2304.04675},
  year={2023}
}

Important! If you wish to be listed as an author in the final report, please complete this information for all the authors of this Pull Request.

  • Full name: David Stap
  • Institution: University of Amsterdam
  • Email: dd.stap@gmail.com
Workshop on Data Contamination org

Hi @davidstap !

Thanks! I am not very familiar with the Flores dataset but it looks like (based on the paper) there are 2 versions: flores101 and flores200. What seems to be contaminated is the flores101 right? What I understand from "BLOOMZ is instruction-tuned with XP3 dataset (Scao et al., 2022), which includes FLORES-200 dataset" is that the training part of FLORES-200 was used for the instruction tunning, and unfortunately, it contained some (or all) examples from development split of FLORES-101.

Am I missing something?
Oscar

Hi @OSainz , thanks for your reply!

Some clarifications:

  • FLORES-101 is a subset of FLORES-200. (FLORES-200 includes an additional 99 languages.)
  • FLORES-200 is contaminated: muennighoff-etal-2023-crosslingual mention FLORES-200 in their paper.
  • FLORES-200 is not a training dataset, but is meant as a high-quality machine translation evaluation dataset. It has two public splits (dev and devtest, 997 and 1012 sentences, respectively) and a secret non-public test set. In practice, a lot of MT papers report scores on the devtest portion, and some use dev as validation data.
  • The facebook/flores dataset is FLORES-200.

Does that clear up your confusions?

David

Workshop on Data Contamination org

Hi @davidstap , thank you for your explanation.

I see, then we should report a 100% contamination of the "dev" set for both corpus and models. Can you add this information to the table?

Additionally, you should also add the PR number (20).

Oscar

Hi @OSainz , thanks for your suggestions. Apologies for the slow response (I'm travelling), I have made the changes just now.

Workshop on Data Contamination org

Hi @davidstap !

Thank you again for your contribution. I made minor changes to be consistent with the rest of the entries. Now I am merging to main :)

Best,
Oscar

OSainz changed pull request status to merged

Sign up or log in to comment