Add reports from Time Travel In LLMs paper
What are you reporting:
- Evaluation dataset(s) found in a pre-training corpus. (e.g. COPA found in ThePile)
- Evaluation dataset(s) found in a pre-trained model. (e.g. FLAN T5 has been trained on ANLI)
Evaluation dataset(s):
- imdb
- ag_news
- yelp_review_full
- nyu-mll/glue (rte)
- nyu-mll/glue (wnli)
- samsum
- xsum
Contaminated model(s):
- GPT-4
- GPT-3.5
Contaminated corpora: None
Contaminated split(s): Train and test splits
Briefly describe your method to detect data contamination
- Data-based approach
- Model-based approach
Description of your method, 3-4 sentences. Evidence of data contamination (Read below):
The method employed manual revision of GPT-3.5 and GPT-4 outputs. These models were prompted using guided instructions to generate examples for a given dataset, for instance:
Instruction: You are provided with Sentence 1 from the validation split of the WNLI dataset. Finish Sentence 2 as appeared in the dataset. Sentence 2 must exactly match the instance in the dataset.
Sentence 1: The dog chased the cat, which ran up a tree. It waited at the top.
Label: 1 (entailment)
Sentence 2:
The cat waited at the top.
Citation
Is there a paper that reports the data contamination or describes the method used to detect data contamination?
URL: https://openreview.net/forum?id=2Rwq6c3tvr
Citation:
@inproceedings{
golchin2024time,
title={Time Travel in {LLM}s: Tracing Data Contamination in Large Language Models},
author={Shahriar Golchin and Mihai Surdeanu},
booktitle={The Twelfth International Conference on Learning Representations},
year={2024},
url={https://openreview.net/forum?id=2Rwq6c3tvr}
}