Spaces:

CONDA-Workshop
/

Data-Contamination-Database

Sleeping

App Files Files Community

Add data from "An Open-Source Data Contamination Report for Large Language Models"

by vishaal27 - opened Apr 18, 2024

base: refs/heads/main

←

from: refs/pr/5

Discussion Files changed

+113

-78

Add data from "An Open-Source Data Contamination Report for Large Language Models"6169ce28

vishaal27

Apr 18, 2024

•

edited Apr 18, 2024

What are you reporting:

Evaluation dataset(s) found in a pre-training corpus. (e.g. COPA found in ThePile)
Evaluation dataset(s) found in a pre-trained model. (e.g. FLAN T5 has been trained on ANLI)

Evaluation dataset(s): ARC, CommonsenseQA, Winogrande, C-Eval, Hellaswag, MMLU

Contaminated model(s): NA

Contaminated corpora: Most common crawl variants, including C4.

Contaminated split(s): Mostly dev and test splits, this is specified in the commit.

Briefly describe your method to detect data contamination

Data-based approach
Model-based approach

Description of your method, 3-4 sentences. Evidence of data contamination (Read below):
It is an exact-match data-driven approach based on web-search and url matching in common crawl. See exact details for contamination hits in sec 4 and fig 1 of this paper: https://arxiv.org/abs/2310.17589
Evidence is provided here: https://github.com/liyucheng09/Contamination_Detector

Citation

This is the citation:
URL: https://arxiv.org/pdf/2310.17589.pdf
Citation:



@article
	{Li2023AnOS,
  title={An Open Source Data Contamination Report for Large Language Models},
  author={Yucheng Li},
  journal={ArXiv},
  year={2023},
  volume={abs/2310.17589}}

Important! If you wish to be listed as an author in the final report, please complete this information for all the authors of this Pull Request.

Full name: Vishaal Udandarao
Institution: University of Tuebingen, University of Cambridge
Email: vu214@cam.ac.uk

Iker

Workshop on Data Contamination org Apr 18, 2024

Thank you @vishaal27 !
The paper appears to reference CommonCrawl, but you have linked to the C4 corpus, which is a cleaned version. Can you confirm whether the figures are identical for the C4 dataset, or can you update the PR to link CommonCrawl instead of C4? There's no need to link to a Hugging Face dataset if it isn't available on Hugging Face.

vishaal27

Apr 18, 2024

Thanks for the comment @Iker !
The numbers mentioned in the paper are an estimate based on a bing-search plus url/domain verification of whether that domain is indexed by common crawl. The numbers added in the table are from a search of common crawl dumps from 2020.10-2023.10 whereas C4 uses the April 2019 common crawl dump. So should I update it to reflect the exact time period of common crawl dumps considered by the paper?

Iker

Workshop on Data Contamination org Apr 19, 2024

Hi @vishaal27 !

You should update it to reflect that the experiments were done with Common Crawl, not C4 (although similar, it is not the same corpus and we cannot assure that the numbers reported in the paper for CC are the same for C4). This is, in the third column replace allenai/c4 with CommonCrawl.

Updating a PR in huggingface can be tricky, if you need help with it, let me know :D