open-llm-leaderboard/open_llm_leaderboard · [FLAG] gaodrew/gaodrew-gorgonzola-13b . Suspected to have MMLU in training data

Aug 22, 2023

•

edited Aug 22, 2023

It is a major outlier with respect to MMLU at 13B parameters. If it doesn't have MMLU data in it's training data, then it is an incredibly impressive result.

It does not have similar increased performance for arc:challenge or hellaswag .

clefourrier

Open LLM Leaderboard org Aug 23, 2023

•

edited Aug 23, 2023

Hi! Great analysis, thank you!
In the sake of fairness, could you open an issue on their repo and ask the model creators what their model was trained on?

Wojx

Aug 23, 2023

Gorgonzola based on our Trurl model, without any information on the model card...
https://huggingface.co/gaodrew/gaodrew-gorgonzola-13b/blob/main/config.json

CoreyMorris

Aug 23, 2023

Yup! Opened an issue https://huggingface.co/gaodrew/gaodrew-gorgonzola-13b/discussions/1 . Thanks to @Wojx for seeing that it is based on the Trurl model. I still asked them to confirm if that was the case in the issue.

pankajmathur

Aug 25, 2023

Ok, I am really liking this analysis charts, Quick questions though, what is range of an outlier, do we have some box plot too ?

For Chart 3: Could 3 dots (representing models) on Hellaswag X axis, can also be represented as outliers if we just inverse the axis with model parameters like the chart1

Also, could you advise how did you do this, I am thinking to run this kind of analysis on models which are trained on synthetic datasets, all derivatives of [flan v2] (https://github.com/google-research/FLAN/tree/main/flan/v2) like Orca-Minis-v1, Dolphin, Open-Orca or any other derivatives.

pankajmathur

Aug 26, 2023

This one seems to be crazy high only on truthfulqa_mc, 55.42 as compare to all other models in this range, Interesting part is according to config.json the base model is RedPajama-INCITE-Chat-3B-v1 which has all other evals higher then this model beside truthfulqa_mc. So yeah I think it is suspicious.

https://huggingface.co/Fredithefish/ReasonixPajama-3B-HF

According to config it is trained on https://huggingface.co/togethercomputer/RedPajama-INCITE-Chat-3B-v1 which has all metrics higher then truthfulqa_mc

I opened a an issue on the original model repo, let's wait for the response:

https://huggingface.co/Fredithefish/ReasonixPajama-3B-HF/discussions/1

pankajmathur

Aug 26, 2023

One more possible candidate in 7b category, I opened issue, asking for details => https://huggingface.co/TigerResearch/tigerbot-7b-sft-v1/discussions/1

pankajmathur

Aug 26, 2023

This one seems to be crazy high only on truthfulqa_mc, 55.42 as compare to all other models in this range, Interesting part is according to config.json the base model is RedPajama-INCITE-Chat-3B-v1 which has all other evals higher then this model beside truthfulqa_mc. So yeah I think it is suspicious.

https://huggingface.co/Fredithefish/ReasonixPajama-3B-HF

According to config it is trained on https://huggingface.co/togethercomputer/RedPajama-INCITE-Chat-3B-v1 which has all metrics higher then truthfulqa_mc

I opened a an issue on the original model repo, let's wait for the response:

https://huggingface.co/Fredithefish/ReasonixPajama-3B-HF/discussions/1

Ok author replied and accepted that parts of the ARC and truthfulQA dataset were used. Please see the comments on thread.

@clefourrier : let us know what should be the next steps for this model on LB.

CoreyMorris

Aug 26, 2023

Ok, I am really liking this analysis charts, Quick questions though, what is range of an outlier, do we have some box plot too ?

For Chart 3: Could 3 dots (representing models) on Hellaswag X axis, can also be represented as outliers if we just inverse the axis with model parameters like the chart1

Also, could you advise how did you do this, I am thinking to run this kind of analysis on models which are trained on synthetic datasets, all derivatives of [flan v2] (https://github.com/google-research/FLAN/tree/main/flan/v2) like Orca-Minis-v1, Dolphin, Open-Orca or any other derivatives.

I didn't create box plots for this, but it wouldn't be difficult to do. Hugging face released the data from the evaluations so I used that to create the plots in a hugging face space here https://huggingface.co/spaces/CoreyMorris/MMLU-by-task-Leaderboard . You can view the code for that hugging face space. There is a link to download a csv of the data there as well.

From just the evaluation results, you can't know for sure which model is trained using evaluation data. I tried briefly to find more potential signals from the detailed results, but I couldn't find anything that seemed reliable. There are other techniques that saw people mention including modify the evaluation questions some, but I didn't look into it further. If you do find some easy and reliable ways to detect a model being trained on evaluation data, let me know :)

clefourrier

Open LLM Leaderboard org Aug 28, 2023

@psmathur Thank you for your checks!
The best would be to create a dedicated issue for the models on the leaderboard, where you provide the link to the model, a description of the problem, a capture of model info/author response (just like you did above, but in a dedicated issue so I can link to it) and I'll flag the model.

CoreyMorris

Sep 6, 2023

The author confirmed it is a derivative of trurl https://huggingface.co/gaodrew/gaodrew-gorgonzola-13b/discussions/1 . This can be closed whenever gaodrew/gaodrew-gorgonzola-13b is added to the list of contaminated models

clefourrier

Open LLM Leaderboard org Sep 6, 2023

Thank you very much for keeping up with this!
Flagged, closing :)

clefourrier changed discussion status to closed Sep 6, 2023