CONDA-Workshop/Data-Contamination-Database

suryanshs16103

May 16

What are you reporting:

Dailymail dataset found in allenai c4 dataset

Evaluation dataset(s): I have used CNN Dailymail Dataset. Path to dataset is cnn_dailymail.

Contaminated model(s): Not Applicable

Contaminated corpora: I have used allenai c4 dataset. Path to dataset is 'allenai/c4'.

Contaminated split(s): Test split found to be 0%

You may also report instances where there is no contamination. In such cases, follow the previous instructions but report a contamination level of 0%.

Briefly describe your method to detect data contamination

Data-based approaches

I utilized a data-based approach to detect contamination in a dataset using an evaluation dataset. First, I preprocessed both datasets consistently and created an index for the training data. I then performed an exact match search for each instance in the evaluation dataset against the training index, recording any matches. After calculating and reporting the percentage of contaminated instances, I optionally checked for partial matches using n-gram overlap to identify near-duplicates.

Citation

URL: https://arxiv.org/abs/2310.20707
Citation: @misc{elazar2024whats, title={What's In My Big Data?}, author={Yanai Elazar and Akshita Bhagia and Ian Magnusson and Abhilasha Ravichander and Dustin Schwenk and Alane Suhr and Pete Walsh and Dirk Groeneveld and Luca Soldaini and Sameer Singh and Hanna Hajishirzi and Noah A. Smith and Jesse Dodge}, year={2024}, eprint={2310.20707}, archivePrefix={arXiv}, primaryClass={cs.CL} }

Important! If you wish to be listed as an author in the final report, please complete this information for all the authors of this Pull Request.

Full name: Suryansh Sharma
Institution: Indian Institute of Technology Kharagpur
Email: suryansh.s@kgpian.iitkgp.ac.in

update contamination.csvfb97fdab

suryanshs16103 changed pull request status to closed May 16