CONDA-Workshop/Data-Contamination-Database · Add model-based results for MedNLI, RadNLI for GPT-3.5 and GPT-4

Apr 19, 2024

•

edited Apr 19, 2024

What are you reporting:

Evaluation dataset(s) found in a pre-training corpus. (e.g. COPA found in ThePile)
Evaluation dataset(s) found in a pre-trained model. (e.g. FLAN T5 has been trained on ANLI)

Evaluation dataset(s): Name(s) of the evaluation dataset(s). If available in the HuggingFace Hub please write the path (e.g. uonlp/CulturaX), otherwise provide a link to a paper, GitHub or dataset-card.

Contaminated model(s):

This PR reports negative results for GPT-3.5 and GPT-4.

Contaminated corpora: None

Contaminated split(s): 0% over train/dev/test (MedNLI) and dev/test (RadNLI).

Briefly describe your method to detect data contamination

Data-based approach
Model-based approach

Description of your method, 3-4 sentences. Evidence of data contamination (Read below):

(Method is same as PR 3)
The only difference between this implementation and the original paper's (Golchin and Surdeanu 2024) is that here multiple runs (3 runs) were performed on each available split; this was to make sure that results hold across different (identically-sized) random data partitions. In addition the models were accessed through Azure OpenAI (opt out of human review + HIPAA-compliant), following MIMIC's DUA. For reference, a sanitized version of the results that keeps the data index, label, outputs, and contamination evaluation results without original input sentences can be found here.
While there are potential positives identified by the ROUGE-based contamination detection method, the best performing (GPT-4 ICL) detector did not consider these instances to be true contaminations, therefore this PR reports negative results (0% contamination for all splits on both datasets based on the examined method).

Citation

Is there a paper that reports the data contamination or describes the method used to detect data contamination?

URL: https://openreview.net/forum?id=2Rwq6c3tvr
Citation:



@article
	{DBLP:journals/corr/abs-2308-08493,
author       = {Shahriar Golchin and
                Mihai Surdeanu},
title        = {Time Travel in LLMs: Tracing Data Contamination in Large Language
                Models},
journal      = {CoRR},
volume       = {abs/2308.08493},
year         = {2023},
url          = {https://doi.org/10.48550/arXiv.2308.08493},
doi          = {10.48550/ARXIV.2308.08493},
eprinttype    = {arXiv},
eprint       = {2308.08493},
timestamp    = {Thu, 24 Aug 2023 12:30:27 +0200},
biburl       = {https://dblp.org/rec/journals/corr/abs-2308-08493.bib},
bibsource    = {dblp computer science bibliography, https://dblp.org}
}

Important! If you wish to be listed as an author in the final report, please complete this information for all the authors of this Pull Request.

Full name: Jenny Chim
Institution: Queen Mary University of London
Email: c.chim@qmul.ac.uk

Add model-based results for MedNLI, RadNLI for GPT-3.5 and GPT-42a036b2f

Add PR linkee219909

Replace name with HF dataset8ff4d82c

Iker

Workshop on Data Contamination org Apr 23, 2024

Thank you @j-chim !!! Merged :D

Iker changed pull request status to merged Apr 23, 2024