Add data from "An Open-Source Data Contamination Report for Large Language Models"

#5
by vishaal27 - opened

What are you reporting:

  • Evaluation dataset(s) found in a pre-training corpus. (e.g. COPA found in ThePile)
  • Evaluation dataset(s) found in a pre-trained model. (e.g. FLAN T5 has been trained on ANLI)

Evaluation dataset(s): ARC, CommonsenseQA, Winogrande, C-Eval, Hellaswag, MMLU

Contaminated model(s): NA

Contaminated corpora: Most common crawl variants, including C4.

Contaminated split(s): Mostly dev and test splits, this is specified in the commit.

Briefly describe your method to detect data contamination

  • Data-based approach
  • Model-based approach

Description of your method, 3-4 sentences. Evidence of data contamination (Read below):
It is an exact-match data-driven approach based on web-search and url matching in common crawl. See exact details for contamination hits in sec 4 and fig 1 of this paper: https://arxiv.org/abs/2310.17589
Evidence is provided here: https://github.com/liyucheng09/Contamination_Detector

Citation

This is the citation:
URL: https://arxiv.org/pdf/2310.17589.pdf
Citation:



@article

	{Li2023AnOS,
  title={An Open Source Data Contamination Report for Large Language Models},
  author={Yucheng Li},
  journal={ArXiv},
  year={2023},
  volume={abs/2310.17589}}

Important! If you wish to be listed as an author in the final report, please complete this information for all the authors of this Pull Request.

  • Full name: Vishaal Udandarao
  • Institution: University of Tuebingen, University of Cambridge
  • Email: vu214@cam.ac.uk
Workshop on Data Contamination org

Thank you @vishaal27 !
The paper appears to reference CommonCrawl, but you have linked to the C4 corpus, which is a cleaned version. Can you confirm whether the figures are identical for the C4 dataset, or can you update the PR to link CommonCrawl instead of C4? There's no need to link to a Hugging Face dataset if it isn't available on Hugging Face.

Thanks for the comment @Iker !
The numbers mentioned in the paper are an estimate based on a bing-search plus url/domain verification of whether that domain is indexed by common crawl. The numbers added in the table are from a search of common crawl dumps from 2020.10-2023.10 whereas C4 uses the April 2019 common crawl dump. So should I update it to reflect the exact time period of common crawl dumps considered by the paper?

Workshop on Data Contamination org

Hi @vishaal27 !

You should update it to reflect that the experiments were done with Common Crawl, not C4 (although similar, it is not the same corpus and we cannot assure that the numbers reported in the paper for CC are the same for C4). This is, in the third column replace allenai/c4 with CommonCrawl.

Updating a PR in huggingface can be tricky, if you need help with it, let me know :D

Thanks for the comments @Iker -- I managed to update the PR, could you check now please! :)

Workshop on Data Contamination org

Thank you @vishaal27 ! I will merge the PR :D

Iker changed pull request status to merged
Workshop on Data Contamination org

Adding the paper's author info for the report:

Sign up or log in to comment