Add Reports Based on "Llemma: An Open Language Model For Mathematics"

#23
by wlchen - opened

What are you reporting:

  • Evaluation dataset(s) found in a pre-training corpus. (e.g. COPA found in ThePile)
  • Evaluation dataset(s) found in a pre-trained model. (e.g. FLAN T5 has been trained on ANLI)

Evaluation dataset(s):

  • hendrycks/competition_math
  • gsm8k

Contaminated model(s):

  • EleutherAI/llemma_7b
  • EleutherAI/llemma_34b

Contaminated corpora:

  • EleutherAI/proof-pile-2

Contaminated split(s):

  • hendrycks/competition_math: 7.72 (%) of test split
  • gsm8k: 0.15 (%) of test split

Briefly describe your method to detect data contamination

  • Data-based approach
  • Model-based approach

Description of your method, 3-4 sentences. Evidence of data contamination (Read below):

Data-based approaches

According to Section 3.5 of Azerbayev et al. (2024), the authors inspect whether any 30-gram in a test sequence (either an input problem or an output solution) occurs in any document of the pre-training corpus Proof-Pile-2, which they use to train LLEMMA models. Base on their exact numbers reported in the left part of Table 6, we can estimate the worst case (assuming non-overlapping instances of input problem and output solution) that the percentage of MATH test split contaminated would be 386 (348 + 34 + 3 + 1) / 5000 = 7.72 (%); and the percentage of GSM8k test split contaminated would be 2 (2 + 0 + 0 + 0) / 1319 = 0.15 (%).

Citation

URL:

https://openreview.net/pdf?id=4WnqRR915j

Citation:

@inproceedings{
   azerbayev2024llemma,
   title={Llemma: An Open Language Model for Mathematics},
   author={Zhangir Azerbayev and Hailey Schoelkopf and Keiran Paster and Marco Dos Santos and Stephen Marcus McAleer and Albert Q. Jiang and Jia Deng and Stella 
   Biderman and Sean Welleck},
   booktitle={The Twelfth International Conference on Learning Representations},
   year={2024},
   url={https://openreview.net/forum?id=4WnqRR915j}
}

Important! If you wish to be listed as an author in the final report, please complete this information for all the authors of this Pull Request.
1.

  • Full name: Wei-Lin Chen
  • Institution: National Taiwan University, University of Virginia
  • Email: tuy8sy@virginia.edu

2.

Workshop on Data Contamination org

Hi @wlchen !

Thank you for your contributions. I have added the PR number and performed some post-processing to be coherent with the rest of the entries.

I will merge the changes to main :)

Best,
Oscar

OSainz changed pull request status to merged

Sign up or log in to comment