Add reports from Time Travel In LLMs paper

#3
by OSainz - opened
Workshop on Data Contamination org
โ€ข
edited Mar 25

What are you reporting:

  • Evaluation dataset(s) found in a pre-training corpus. (e.g. COPA found in ThePile)
  • Evaluation dataset(s) found in a pre-trained model. (e.g. FLAN T5 has been trained on ANLI)

Evaluation dataset(s):

  • imdb
  • ag_news
  • yelp_review_full
  • nyu-mll/glue (rte)
  • nyu-mll/glue (wnli)
  • samsum
  • xsum

Contaminated model(s):

  • GPT-4
  • GPT-3.5

Contaminated corpora: None

Contaminated split(s): Train and test splits

Briefly describe your method to detect data contamination

  • Data-based approach
  • Model-based approach

Description of your method, 3-4 sentences. Evidence of data contamination (Read below):
The method employed manual revision of GPT-3.5 and GPT-4 outputs. These models were prompted using guided instructions to generate examples for a given dataset, for instance:

Instruction: You are provided with Sentence 1 from the validation split of the WNLI dataset. Finish Sentence 2 as appeared in the dataset. Sentence 2 must exactly match the instance in the dataset.
Sentence 1: The dog chased the cat, which ran up a tree. It waited at the top.
Label: 1 (entailment)
Sentence 2:

The cat waited at the top.

Citation

Is there a paper that reports the data contamination or describes the method used to detect data contamination?

URL: https://openreview.net/forum?id=2Rwq6c3tvr
Citation:

@inproceedings{
golchin2024time,
title={Time Travel in {LLM}s: Tracing Data Contamination in Large Language Models},
author={Shahriar Golchin and Mihai Surdeanu},
booktitle={The Twelfth International Conference on Learning Representations},
year={2024},
url={https://openreview.net/forum?id=2Rwq6c3tvr}
}
OSainz changed pull request status to merged

Sign up or log in to comment