CONDA-Workshop/Data-Contamination-Database · Add Reports Based on "Llemma: An Open Language Model For Mathematics"

wlchen

May 11, 2024

•

edited May 11, 2024

What are you reporting:

Evaluation dataset(s) found in a pre-training corpus. (e.g. COPA found in ThePile)
Evaluation dataset(s) found in a pre-trained model. (e.g. FLAN T5 has been trained on ANLI)

Evaluation dataset(s):

hendrycks/competition_math
gsm8k

Contaminated model(s):

EleutherAI/llemma_7b
EleutherAI/llemma_34b

Contaminated corpora:

EleutherAI/proof-pile-2

Contaminated split(s):

hendrycks/competition_math: 7.72 (%) of test split
gsm8k: 0.15 (%) of test split

Briefly describe your method to detect data contamination

Data-based approach
Model-based approach

Description of your method, 3-4 sentences. Evidence of data contamination (Read below):

Data-based approaches

According to Section 3.5 of Azerbayev et al. (2024), the authors inspect whether any 30-gram in a test sequence (either an input problem or an output solution) occurs in any document of the pre-training corpus Proof-Pile-2, which they use to train LLEMMA models. Base on their exact numbers reported in the left part of Table 6, we can estimate the worst case (assuming non-overlapping instances of input problem and output solution) that the percentage of MATH test split contaminated would be 386 (348 + 34 + 3 + 1) / 5000 = 7.72 (%); and the percentage of GSM8k test split contaminated would be 2 (2 + 0 + 0 + 0) / 1319 = 0.15 (%).

Citation

URL:

https://openreview.net/pdf?id=4WnqRR915j

Citation:

@inproceedings{
   azerbayev2024llemma,
   title={Llemma: An Open Language Model for Mathematics},
   author={Zhangir Azerbayev and Hailey Schoelkopf and Keiran Paster and Marco Dos Santos and Stephen Marcus McAleer and Albert Q. Jiang and Jia Deng and Stella 
   Biderman and Sean Welleck},
   booktitle={The Twelfth International Conference on Learning Representations},
   year={2024},
   url={https://openreview.net/forum?id=4WnqRR915j}
}

Important! If you wish to be listed as an author in the final report, please complete this information for all the authors of this Pull Request.
1.

Full name: Wei-Lin Chen
Institution: National Taiwan University, University of Virginia
Email: tuy8sy@virginia.edu

2.

Full name: Yu-Min Tseng
Institution: National Taiwan University
Email: ymtseng@nlg.csie.ntu.edu.tw

Add Reports Based on "Llemma: An Open Language Model For Mathematics"c50904f0

Add PR number + Postprocessing582a8ca7

OSainz

Workshop on Data Contamination org May 13, 2024

Hi @wlchen !

Thank you for your contributions. I have added the PR number and performed some post-processing to be coherent with the rest of the entries.

I will merge the changes to main :)

Best,
Oscar

OSainz changed pull request status to merged May 13, 2024