Eval models for data contamination?

#561
by liyucheng - opened

Well, I think it's fair to say that data contamination is destroying the reliability leaderboard.

I just did this experiments, that test Llama-2, Baichuan2, and Yi in clean and containated test set.

The results is kinda surprising: models seem to achieve 10 pts more on dirty dataset.

dataset     version    mode    baichuan2-7b-base-hf    -                              -                                        qwen-7b-hf        -                              -                                        llama_30b_autogptq    -                              -
----------  ---------  ------  ----------------------  -----------------------------  ---------------------------------------  ----------------  -----------------------------  ---------------------------------------  --------------------  -----------------------------  ---------------------------------------
-           -          -       accuracy - clean        accuracy - input contaminated  accuracy - input-and-label contaminated  accuracy - clean  accuracy - input contaminated  accuracy - input-and-label contaminated  accuracy - clean      accuracy - input contaminated  accuracy - input-and-label contaminated
mmlu        -          ppl     56.76                   44.69                          54.93                                    58.74             48.67                          58.28                                    57.46                 45.72                          57.16
hellaswag   47bff9     ppl     66.87                   57.14                          70.97                                    86.42             89.29                          90.88                                    76.71                 57.14                          82.37

I just add this feature to OpenCompass, I wonder if there is anyone interested in proposal? I could do a more comphrehensive analysis on data contaminatoin?

Check out my implementation and reports here: https://github.com/liyucheng09/Contamination_Detector.

Hugging Face H4 org

Hi @liyucheng ,
Thanks for your comment!
Can you detail your methodology a bit?

@clefourrier Hi Clémentine, sorry for the late reply.

I have discussed my approach with Edward Benching in one of the interview.

Basically I check benchmark examples' persence in Common Crawl, and classify them into three categories:

  1. Clean.
  2. Input-only contamination: the input (question/passage) appears in Common Crawl, but not the label/answer.
  3. Input-and-label contamination: the contamination give away both input and label.

And we calculate metrics seperately on them to have am impression about their contamination degree.

According to the the practice in OpenCompass, it's convenient, requiring no extra computing.
I have done checks for six popular QA benchmark till now, see here.

Sign up or log in to comment