Update contamination_report.csv
Browse files**What are you reporting:**
- [x] Evaluation dataset(s) found in a pre-training corpus. (e.g. COPA found in ThePile)
- [ ] Evaluation dataset(s) found in a pre-trained model. (e.g. FLAN T5 has been trained on ANLI)
**Contaminated Evaluation Dataset(s):**
- openai_humaneval
- mbpp
**Contaminated Corpora:**
- EleutherAI/pile
- bigcode/the-stack
**Approach:**
- [x] Data-based approach
- [ ] Model-based approach
**Description of your method, 3-4 sentences. Evidence of data contamination:**
An example in the test data (i.e., those of MBPP or HumanEval), is noted as contaminated if the aggregated similarity score is 100, i.e., a perfect match exists on the surface- or semantic-level. Levenshtein similarity score is used to measure surface-level similarity between programs and Dolos toolkit), which is a source code plagiarism detection tool for education purposes, measure semantic similarity between programs.
**Citation**
Is there a paper that reports the data contamination or describes the method used to detect data contamination? Yes
**url**: [https://arxiv.org/abs/2403.04811](https://arxiv.org/abs/2403.04811)
```@article{riddell2024quantifying,
  title={Quantifying contamination in evaluating code generation capabilities of language models},
  author={Riddell, Martin and Ni, Ansong and Cohan, Arman},
  journal={arXiv preprint arXiv:2403.04811},
  year={2024}
}
```
Important! If you wish to be listed as an author in the final report, please complete this information for all the authors of this Pull Request.
Full name: Ameya Prabhu
Institution: Tübingen AI Center, University of Tübingen
Email: ameya@prabhu.be
- contamination_report.csv +6 -1
| @@ -462,4 +462,9 @@ bigbio/mednli;;GPT-3.5;model;0.0;0.0;0.0;model-based;https://arxiv.org/pdf/2308. | |
| 462 |  | 
| 463 | 
             
            RadNLI;;GPT-4;model;0.0;0.0;0.0;model-based;https://arxiv.org/pdf/2308.08493;8
         | 
| 464 | 
             
            RadNLI;;GPT-3.5;model;0.0;0.0;0.0;model-based;https://arxiv.org/pdf/2308.08493;8
         | 
| 465 | 
            -
             | 
|  | |
|  | |
|  | |
|  | |
|  | 
|  | |
| 462 |  | 
| 463 | 
             
            RadNLI;;GPT-4;model;0.0;0.0;0.0;model-based;https://arxiv.org/pdf/2308.08493;8
         | 
| 464 | 
             
            RadNLI;;GPT-3.5;model;0.0;0.0;0.0;model-based;https://arxiv.org/pdf/2308.08493;8
         | 
| 465 | 
            +
             | 
| 466 | 
            +
             | 
| 467 | 
            +
            openai_humaneval;;EleutherAI/pile;corpus;;;12.2;data-based;https://arxiv.org/abs/2403.04811;
         | 
| 468 | 
            +
            mbpp;;EleutherAI/pile;corpus;;;3.6;data-based;https://arxiv.org/abs/2403.04811;
         | 
| 469 | 
            +
            openai_humaneval;;bigcode/the-stack;corpus;;;18.9;data-based;https://arxiv.org/abs/2403.04811;
         | 
| 470 | 
            +
            mbpp;;bigcode/the-stack;corpus;;;20.8;data-based;https://arxiv.org/abs/2403.04811;
         | 

