CONDA-Workshop/Data-Contamination-Database

May 16

What are you reporting:

glue-ax, glue-mnli-matched, glue-mnli-mismatched, glue-mrpc, glue-rte, glue-stsb, glue-wnli dataset found in EleutherAI/pile dataset

Evaluation dataset(s): I have used glue-ax, glue-mnli-matched, glue-mnli-mismatched, glue-mrpc, glue-rte, glue-stsb, glue-wnli. These datasets are not available at Hugging Face.

Contaminated model(s): Not Applicable

Contaminated corpora: I have used the Pile dataset. Path to dataset is 'EleutherAI/pile dataset'.

Contaminated split(s): Test split found to be 5.07%, 2.17%, 2.11%, 0.64%, 0.13%, 11.09%, 0.0% respectively in the evaluation datasets.

You may also report instances where there is no contamination. In such cases, follow the previous instructions but report a contamination level of 0%.

Briefly describe your method to detect data contamination

Data-based approaches

Data contamination is detected using WIMBD, which has two main components: (1) a search tool utilizing an Elasticsearch index for retrieving and analyzing document occurrences, and (2) a count functionality built with map-reduce for quick iteration and extraction of relevant information like duplicates, PII, and domain counts. This allows for scalable analysis and comparison across web-scale datasets.
These values can be verified in Appendix B.3.1 "Benchmark Contamination" of the cited paper.

Citation

URL: https://arxiv.org/abs/2310.20707
Citation: @misc{elazar2024whats, title={What's In My Big Data?}, author={Yanai Elazar and Akshita Bhagia and Ian Magnusson and Abhilasha Ravichander and Dustin Schwenk and Alane Suhr and Pete Walsh and Dirk Groeneveld and Luca Soldaini and Sameer Singh and Hanna Hajishirzi and Noah A. Smith and Jesse Dodge}, year={2024}, eprint={2310.20707}, archivePrefix={arXiv}, primaryClass={cs.CL} }

Important! If you wish to be listed as an author in the final report, please complete this information for all the authors of this Pull Request.

Full name: Suryansh Sharma
Institution: Indian Institute of Technology Kharagpur
Email: suryansh.s@kgpian.iitkgp.ac.in

Update contamination.csvc3eccc20

OSainz

Workshop on Data Contamination org May 17

Hi @suryanshs16103 !

Unfortunately, the evidence you are trying to add is already in the database. For instance, try looking for "glue" in the Evaluation dataset field. Additionally, those datasets are actually in HuggingFace, you can find them as subsets of nyu-mll/glue and super_gluedatasets.

In any case, thank you for your contribution. I am closing this PR.

Best,
Oscar

OSainz changed pull request status to closed May 17