suryanshs16103 commited on
Commit
fb97fda
Β·
verified Β·
1 Parent(s): 9fba4d8

update contamination.csv

Browse files

## What are you reporting:
- Dailymail dataset found in allenai c4 dataset

**Evaluation dataset(s)**: I have used CNN Dailymail Dataset. Path to dataset is `cnn_dailymail`.

**Contaminated model(s)**: Not Applicable

**Contaminated corpora**: I have used allenai c4 dataset. Path to dataset is 'allenai/c4'.

**Contaminated split(s)**: Test split found to be 0%

> You may also report instances where there is no contamination. In such cases, follow the previous instructions but report a contamination level of 0%.

## Briefly describe your method to detect data contamination

#### Data-based approaches
I utilized a data-based approach to detect contamination in a dataset using an evaluation dataset. First, I preprocessed both datasets consistently and created an index for the training data. I then performed an exact match search for each instance in the evaluation dataset against the training index, recording any matches. After calculating and reporting the percentage of contaminated instances, I optionally checked for partial matches using n-gram overlap to identify near-duplicates.

## Citation

URL: `https://arxiv.org/abs/2310.20707`
Citation: `@misc{elazar2024whats,
title={What's In My Big Data?},
author={Yanai Elazar and Akshita Bhagia and Ian Magnusson and Abhilasha Ravichander and Dustin Schwenk and Alane Suhr and Pete Walsh and Dirk Groeneveld and Luca Soldaini and Sameer Singh and Hanna Hajishirzi and Noah A. Smith and Jesse Dodge},
year={2024},
eprint={2310.20707},
archivePrefix={arXiv},
primaryClass={cs.CL}
}`


*Important!* If you wish to be listed as an author in the final report, please complete this information for all the authors of this Pull Request.
- Full name: Suryansh Sharma
- Institution: Indian Institute of Technology Kharagpur
- Email: suryansh.s@kgpian.iitkgp.ac.in

Files changed (1) hide show
  1. contamination_report.csv +2 -0
contamination_report.csv CHANGED
@@ -707,3 +707,5 @@ zest;;EleutherAI/pile;;corpus;;;0.0;data-based;https://arxiv.org/abs/2310.20707;
707
  zest;;allenai/c4;;corpus;;;0.0;data-based;https://arxiv.org/abs/2310.20707;2
708
  zest;;oscar-corpus/OSCAR-2301;;corpus;;;0.0;data-based;https://arxiv.org/abs/2310.20707;2
709
  zest;;togethercomputer/RedPajama-Data-V2;;corpus;;;0.0;data-based;https://arxiv.org/abs/2310.20707;2
 
 
 
707
  zest;;allenai/c4;;corpus;;;0.0;data-based;https://arxiv.org/abs/2310.20707;2
708
  zest;;oscar-corpus/OSCAR-2301;;corpus;;;0.0;data-based;https://arxiv.org/abs/2310.20707;2
709
  zest;;togethercomputer/RedPajama-Data-V2;;corpus;;;0.0;data-based;https://arxiv.org/abs/2310.20707;2
710
+
711
+ cnn_dailymail;;allenai/c4;;corpus;;;0.0;data-based;https://arxiv.org/abs/2310.20707;