CONDA-Workshop/Data-Contamination-Database · Add reports from Time Travel In LLMs paper

OSainz

Workshop on Data Contamination org Mar 25, 2024

•

edited Mar 25, 2024

What are you reporting:

Evaluation dataset(s) found in a pre-training corpus. (e.g. COPA found in ThePile)
Evaluation dataset(s) found in a pre-trained model. (e.g. FLAN T5 has been trained on ANLI)

Evaluation dataset(s):

imdb
ag_news
yelp_review_full
nyu-mll/glue (rte)
nyu-mll/glue (wnli)
samsum
xsum

Contaminated model(s):

GPT-4
GPT-3.5

Contaminated corpora: None

Contaminated split(s): Train and test splits

Briefly describe your method to detect data contamination

Data-based approach
Model-based approach

Description of your method, 3-4 sentences. Evidence of data contamination (Read below):
The method employed manual revision of GPT-3.5 and GPT-4 outputs. These models were prompted using guided instructions to generate examples for a given dataset, for instance:

Instruction: You are provided with Sentence 1 from the validation split of the WNLI dataset. Finish Sentence 2 as appeared in the dataset. Sentence 2 must exactly match the instance in the dataset.
Sentence 1: The dog chased the cat, which ran up a tree. It waited at the top.
Label: 1 (entailment)
Sentence 2:

The cat waited at the top.

Citation

Is there a paper that reports the data contamination or describes the method used to detect data contamination?

URL: https://openreview.net/forum?id=2Rwq6c3tvr
Citation:

@inproceedings{
golchin2024time,
title={Time Travel in {LLM}s: Tracing Data Contamination in Large Language Models},
author={Shahriar Golchin and Mihai Surdeanu},
booktitle={The Twelfth International Conference on Learning Representations},
year={2024},
url={https://openreview.net/forum?id=2Rwq6c3tvr}
}

Add reports from Time Travel In LLMs paper68631643

Add PR ids692d1d06

OSainz changed pull request status to merged Mar 25, 2024