[FLAG] gaodrew/gaodrew-gorgonzola-13b . Suspected to have MMLU in training data

#215
by CoreyMorris - opened

It is a major outlier with respect to MMLU at 13B parameters. If it doesn't have MMLU data in it's training data, then it is an incredibly impressive result.

image.png

It does not have similar increased performance for arc:challenge or hellaswag .

image.png

image.png

Open LLM Leaderboard org
edited Aug 23, 2023

Hi! Great analysis, thank you!
In the sake of fairness, could you open an issue on their repo and ask the model creators what their model was trained on?

Gorgonzola based on our Trurl model, without any information on the model card...
https://huggingface.co/gaodrew/gaodrew-gorgonzola-13b/blob/main/config.json

Yup! Opened an issue https://huggingface.co/gaodrew/gaodrew-gorgonzola-13b/discussions/1 . Thanks to @Wojx for seeing that it is based on the Trurl model. I still asked them to confirm if that was the case in the issue.

Ok, I am really liking this analysis charts, Quick questions though, what is range of an outlier, do we have some box plot too ?

  • For Chart 3: Could 3 dots (representing models) on Hellaswag X axis, can also be represented as outliers if we just inverse the axis with model parameters like the chart1

Also, could you advise how did you do this, I am thinking to run this kind of analysis on models which are trained on synthetic datasets, all derivatives of [flan v2] (https://github.com/google-research/FLAN/tree/main/flan/v2) like Orca-Minis-v1, Dolphin, Open-Orca or any other derivatives.

This one seems to be crazy high only on truthfulqa_mc, 55.42 as compare to all other models in this range, Interesting part is according to config.json the base model is RedPajama-INCITE-Chat-3B-v1 which has all other evals higher then this model beside truthfulqa_mc. So yeah I think it is suspicious.

https://huggingface.co/Fredithefish/ReasonixPajama-3B-HF

Screenshot 2023-08-26 at 12.58.07 AM.png

According to config it is trained on https://huggingface.co/togethercomputer/RedPajama-INCITE-Chat-3B-v1 which has all metrics higher then truthfulqa_mc

Screenshot 2023-08-26 at 1.02.12 AM.png

I opened a an issue on the original model repo, let's wait for the response:

https://huggingface.co/Fredithefish/ReasonixPajama-3B-HF/discussions/1

One more possible candidate in 7b category, I opened issue, asking for details => https://huggingface.co/TigerResearch/tigerbot-7b-sft-v1/discussions/1

Screenshot 2023-08-26 at 1.23.19 AM.png

This one seems to be crazy high only on truthfulqa_mc, 55.42 as compare to all other models in this range, Interesting part is according to config.json the base model is RedPajama-INCITE-Chat-3B-v1 which has all other evals higher then this model beside truthfulqa_mc. So yeah I think it is suspicious.

https://huggingface.co/Fredithefish/ReasonixPajama-3B-HF

Screenshot 2023-08-26 at 12.58.07 AM.png

According to config it is trained on https://huggingface.co/togethercomputer/RedPajama-INCITE-Chat-3B-v1 which has all metrics higher then truthfulqa_mc

Screenshot 2023-08-26 at 1.02.12 AM.png

I opened a an issue on the original model repo, let's wait for the response:

https://huggingface.co/Fredithefish/ReasonixPajama-3B-HF/discussions/1

Ok author replied and accepted that parts of the ARC and truthfulQA dataset were used. Please see the comments on thread.

@clefourrier : let us know what should be the next steps for this model on LB.
IMG_0817.png

Ok, I am really liking this analysis charts, Quick questions though, what is range of an outlier, do we have some box plot too ?

  • For Chart 3: Could 3 dots (representing models) on Hellaswag X axis, can also be represented as outliers if we just inverse the axis with model parameters like the chart1

Also, could you advise how did you do this, I am thinking to run this kind of analysis on models which are trained on synthetic datasets, all derivatives of [flan v2] (https://github.com/google-research/FLAN/tree/main/flan/v2) like Orca-Minis-v1, Dolphin, Open-Orca or any other derivatives.

I didn't create box plots for this, but it wouldn't be difficult to do. Hugging face released the data from the evaluations so I used that to create the plots in a hugging face space here https://huggingface.co/spaces/CoreyMorris/MMLU-by-task-Leaderboard . You can view the code for that hugging face space. There is a link to download a csv of the data there as well.

From just the evaluation results, you can't know for sure which model is trained using evaluation data. I tried briefly to find more potential signals from the detailed results, but I couldn't find anything that seemed reliable. There are other techniques that saw people mention including modify the evaluation questions some, but I didn't look into it further. If you do find some easy and reliable ways to detect a model being trained on evaluation data, let me know :)

Open LLM Leaderboard org

@psmathur Thank you for your checks!
The best would be to create a dedicated issue for the models on the leaderboard, where you provide the link to the model, a description of the problem, a capture of model info/author response (just like you did above, but in a dedicated issue so I can link to it) and I'll flag the model.

The author confirmed it is a derivative of trurl https://huggingface.co/gaodrew/gaodrew-gorgonzola-13b/discussions/1 . This can be closed whenever gaodrew/gaodrew-gorgonzola-13b is added to the list of contaminated models

Open LLM Leaderboard org

Thank you very much for keeping up with this!
Flagged, closing :)

clefourrier changed discussion status to closed

Sign up or log in to comment