Spaces:

HuggingFaceH4
/

open_llm_leaderboard

Running on CPU Upgrade

App Files Files Community

748

i have proof that show the evals shouldn't be trusted

#116

by breadlicker45 - opened Jul 19, 2023

Discussion

breadlicker45

Jul 19, 2023

•

edited Jul 19, 2023

seen here is my models, bread ai. my models are around 160-200M parameters.

PM_modelv2 (a badly fine-tuned prompt making model trained on a m40 gpu i own) beats a 60B model, that is sooo false

musePY basically a random number generator

breadlicker45 changed discussion title from i have proof that show the evals are wrong and shouldn to i have proof that show the evals shouldn't be trusted Jul 19, 2023

zmcmcc

Jul 19, 2023

Lol..Also ehartford/WizardLM-30B-Uncensored has about 60 points and WizardLM/WizardLM-30B-V1.0 has only 30. I know the two models are different but seems not quite possible to have such huge performance difference.
Several weeks ago the HF team did said they were rewriting code to fix previous eval errors or something.

clefourrier

Hugging Face H4 org Jul 19, 2023

Hi @zmcmcc and @breadlicker45 !
We have re-run all models to use the new fixed MMLU evals from the Harness, and are currently re-running some Llama scores.

Did you know that you can actually reproduce our results by launching the commands on the About section? If there are models you feel unsure about, feel free to double check them by re-running evals and giving us your results!

breadlicker45

Jul 19, 2023

•

edited Jul 19, 2023

Hi @zmcmcc and @breadlicker45 !
We have re-run all models to use the new fixed MMLU evals from the Harness, and are currently re-running some Llama scores.

Did you know that you can actually reproduce our results by launching the commands on the About section? If there are models you feel unsure about, feel free to double check them by re-running evals and giving us your results!

Sure run musepy and you will see just by using your brain it is wrong, and use your brain not evals and you can see it is wrong

breadlicker45

Jul 19, 2023

Lol..Also ehartford/WizardLM-30B-Uncensored has about 60 points and WizardLM/WizardLM-30B-V1.0 has only 30. I know the two models are different but seems not quite possible to have such huge performance difference.
Several weeks ago the HF team did said they were rewriting code to fix previous eval errors or something.

The backend is https://github.com/EleutherAI/lm-evaluation-harness Confirmed by Stella

clefourrier

Hugging Face H4 org Jul 19, 2023

•

edited Jul 19, 2023

Hi!
Just checked your model's results (BreadAI/PM_model_V2) more in depth, and you actually get random scores for all evals (around 25% = random baseline) except for TruthfulQA, which has a slightly unbalanced answer distribution - I suspect your random generator got lucky here!
A good way of pointing out that evals, especially for scores so low, do not tell the whole story :)

clefourrier changed discussion status to closed Jul 19, 2023

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment