Spaces:

CONDA-Workshop
/

Data-Contamination-Database

Running

App Files Files Community

Add data from "Documenting Large Webtext Corpora: A Case Study on the Colossal Clean Crawled Corpus"

by vishaal27 - opened Apr 18, 2024

base: refs/heads/main

←

from: refs/pr/6

Discussion Files changed

+22

-0

vishaal27

Apr 18, 2024

•

edited Apr 18, 2024

What are you reporting:

Evaluation dataset(s) found in a pre-training corpus. (e.g. COPA found in ThePile)
Evaluation dataset(s) found in a pre-trained model. (e.g. FLAN T5 has been trained on ANLI)

Evaluation dataset(s): LAMA (T-REx), LAMA (Google-RE), XSum, TIFU-short, TIFU-long, WikiBio, AMR-to-text, GLUE (BoolQ, CoLA, MNLI, MRPC, QNLI, RTE, SST-2 ,STS-B, WNLI)

Contaminated model(s): NA

Contaminated corpora: allenai/c4

Contaminated split(s): All test splits

Briefly describe your method to detect data contamination

Data-based approach
Model-based approach

Description of your method, 3-4 sentences. Evidence of data contamination (Read below):
The approach was simple: exact matches, normalized for capitalization and punctuation, for more details see sec 4.2 in https://arxiv.org/abs/2104.08758
For evidence of contamination, see the original paper.

Citation

Yes , here is the link:
URL: https://arxiv.org/pdf/2104.08758.pdf
Citation:



@article
	{dodge2021documenting,
  title={Documenting large webtext corpora: A case study on the colossal clean crawled corpus},
  author={Dodge, Jesse and Sap, Maarten and Marasovi{\'c}, Ana and Agnew, William and Ilharco, Gabriel and Groeneveld, Dirk and Mitchell, Margaret and Gardner, Matt},
  journal={arXiv preprint arXiv:2104.08758},
  year={2021}
}

Important! If you wish to be listed as an author in the final report, please complete this information for all the authors of this Pull Request.

Full name: Vishaal Udandarao
Institution: University of Tuebingen, University of Cambridge
Email: vu214@cam.ac.uk

Add data from "Documenting Large Webtext Corpora: A Case Study on the Colossal Clean Crawled Corpus"ad06fdcd

Iker

Workshop on Data Contamination org Apr 18, 2024

@vishaal27 Thank you! Merged :D

Iker changed pull request status to merged Apr 18, 2024

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment