Spaces:

CONDA-Workshop
/

Data-Contamination-Database

Running

App Files Files Community

wlchen commited on May 11, 2024

Commit

c50904f

verified ·

1 Parent(s): 100cb5e

Add Reports Based on "Llemma: An Open Language Model For Mathematics"

Browse files

## What are you reporting:
- [x] Evaluation dataset(s) found in a pre-training corpus. (e.g. COPA found in ThePile)
- [ ] Evaluation dataset(s) found in a pre-trained model. (e.g. FLAN T5 has been trained on ANLI)

**Evaluation dataset(s)**:
- `hendrycks/competition_math`
- `gsm8k`

**Contaminated model(s)**:
- `EleutherAI/llemma_7b`
- `EleutherAI/llemma_34b`

**Contaminated corpora**:
- `EleutherAI/proof-pile-2`

**Contaminated split(s)**:
- `hendrycks/competition_math`: 7.72 (%) of `test` split
- `gsm8k`: 0.15 (%) of `test` split

## Briefly describe your method to detect data contamination

- [x] Data-based approach
- [ ] Model-based approach

Description of your method, 3-4 sentences. Evidence of data contamination (Read below):

#### Data-based approaches

According to Section 3.5 of [Azerbayev et al. (2024)](https://arxiv.org/abs/2310.10631), the authors inspect whether any 30-gram in a test sequence (either an input problem or an output solution) occurs in any document of the pre-training corpus `Proof-Pile-2`, which they use to train `LLEMMA` models. Base on their exact numbers reported in the *left* part of Table 6, we can estimate the worst case (assuming non-overlapping instances of input problem and output solution) that the percentage of `MATH` test split contaminated would be 386 (348 + 34 + 3 + 1) / 5000 = 7.72 (%); and the percentage of `GSM8k` test split contaminated would be 2 (2 + 0 + 0 + 0) / 1319 = 0.15 (%).

## Citation

URL:
```
https://openreview.net/pdf?id=4WnqRR915j
```
Citation:
```
@inproceedings{
azerbayev2024llemma,
title={Llemma: An Open Language Model for Mathematics},
author={Zhangir Azerbayev and Hailey Schoelkopf and Keiran Paster and Marco Dos Santos and Stephen Marcus McAleer and Albert Q. Jiang and Jia Deng and Stella
Biderman and Sean Welleck},
booktitle={The Twelfth International Conference on Learning Representations},
year={2024},
url={https://openreview.net/forum?id=4WnqRR915j}
}
```

*Important!* If you wish to be listed as an author in the final report, please complete this information for all the authors of this Pull Request.
1.
- Full name: Wei-Lin Chen
- Institution: National Taiwan University, University of Virginia
- Email: tuy8sy@virginia.edu
2.
- Full name: Yu-Min Tseng
- Institution: National Taiwan University
- Email: ymtseng@nlg.csie.ntu.edu.tw

Files changed (1) hide show

contamination_report.csv +7 -0

contamination_report.csv CHANGED Viewed

@@ -163,9 +163,16 @@ gigaword;;togethercomputer/RedPajama-Data-V2;;corpus;;;2.82;data-based;https://a
 gsm8k;;BAAI/Aquila2-34B;;model;;;100.0;model-based;https://huggingface.co/BAAI/Aquila2-34B/blob/main/README.md;21
 gsm8k;;BAAI/AquilaChat2-34B;;model;;;100.0;model-based;https://huggingface.co/BAAI/AquilaChat2-34B/blob/main/README.md;21
 gsm8k;;GPT-4;;model;100.0;;1.0;data-based;https://arxiv.org/abs/2303.08774;11
 gsm8k;;GPT-4;;model;79.00;;;model-based;https://arxiv.org/abs/2311.06233;8
 head_qa;en;EleutherAI/pile;;corpus;;;5.11;data-based;https://arxiv.org/abs/2310.20707;2
 head_qa;en;allenai/c4;;corpus;;;5.22;data-based;https://arxiv.org/abs/2310.20707;2
 head_qa;en;oscar-corpus/OSCAR-2301;;corpus;;;5.29;data-based;https://arxiv.org/abs/2310.20707;2

 gsm8k;;BAAI/Aquila2-34B;;model;;;100.0;model-based;https://huggingface.co/BAAI/Aquila2-34B/blob/main/README.md;21
 gsm8k;;BAAI/AquilaChat2-34B;;model;;;100.0;model-based;https://huggingface.co/BAAI/AquilaChat2-34B/blob/main/README.md;21
+gsm8k;;EleutherAI/llemma_7b;;model;;;0.15;data-based;https://openreview.net/pdf?id=4WnqRR915j;
+gsm8k;;EleutherAI/llemma_34b;;model;;;0.15;data-based;https://openreview.net/pdf?id=4WnqRR915j;
+gsm8k;;EleutherAI/proof-pile-2;;corpus;;;0.15;data-based;https://openreview.net/pdf?id=4WnqRR915j;
 gsm8k;;GPT-4;;model;100.0;;1.0;data-based;https://arxiv.org/abs/2303.08774;11
 gsm8k;;GPT-4;;model;79.00;;;model-based;https://arxiv.org/abs/2311.06233;8
+hendrycks/competition_math;;EleutherAI/llemma_7b;;model;;;7.72;data-based;https://openreview.net/pdf?id=4WnqRR915j;
+hendrycks/competition_math;;EleutherAI/llemma_34b;;model;;;7.72;data-based;https://openreview.net/pdf?id=4WnqRR915j;
+hendrycks/competition_math;;EleutherAI/proof-pile-2;;corpus;;;7.72;data-based;https://openreview.net/pdf?id=4WnqRR915j;
 head_qa;en;EleutherAI/pile;;corpus;;;5.11;data-based;https://arxiv.org/abs/2310.20707;2
 head_qa;en;allenai/c4;;corpus;;;5.22;data-based;https://arxiv.org/abs/2310.20707;2
 head_qa;en;oscar-corpus/OSCAR-2301;;corpus;;;5.29;data-based;https://arxiv.org/abs/2310.20707;2