open-llm-leaderboard/open_llm_leaderboard · re-evaluating failed models after fixing the NaN weights

awnr

Aug 2

Would someone re-evaluate 4 of my models?

In late June, I uploaded and submitted a few models. Unfortunately, I did some bad pointer math while writing the safetensors files and filled them with NaNs. The benchmarks gave bad results. I have now fixed the NaN problem in the safetensors files. In the table below, I list the models and the most recent commit hash. Unfortunately, only 1 of them seems to have been evaluated correctly (thanks @clefourrier for the suggestion to make an Issue and for re-evaluating the model that explicitly failed).

Most likely, the other 4 models ran with either an empty repo or a cache of the old NaN-filled safetensors files.

model	hash
❌ awnr/Mistral-7B-v0.1-signtensors-1-over-2	98d8ea1dedcbd1f0406d229e45f983a0673b01f4
❌ awnr/Mistral-7B-v0.1-signtensors-7-over-16	084bbc5b3d021c08c00031dc2b9830d41cae068d
❌ awnr/Mistral-7B-v0.1-signtensors-3-over-8	bb888e45945f39e6eb7d23f31ebbff2e38b6c4f2
✅ awnr/Mistral-7B-v0.1-signtensors-5-over-16	5ea13b3d0723237889e1512bc70dae72f71884d1
❌ awnr/Mistral-7B-v0.1-signtensors-1-over-4	0a90af3d9032740d4c23f0ddb405f65a2f48f0d4

clefourrier

Open LLM Leaderboard org Aug 5

Hi!
Thanks for opening an issue.
Please provide the request files again, as requested in the FAQ.
Btw, I think I already checked this in the other issue, and the models indeed ran with the correct commits; can you check in the request files?

awnr

Aug 5

•

edited Aug 5

Hey @clefourrier I really appreciate your feedback. I'm adding/clarifying some details.

Request Files

Hi!
Thanks for opening an issue.
Please provide the request files again, as requested in the FAQ.

Sure thing!

Clarification

Btw, I think I already checked this in the other issue, and the models indeed ran with the correct commits; can you check in the request files?

Thanks. I think 5-over-16 ran correctly. I do see that the models' statues are FINISHED (with the exception of *1-over-4 which appears to have been manually set to PENDING, though it's not visible in the queues). However, for 4 of the 5 models, the results don't improve after fixing the NaNs in the safetensors files.

My script (now fixed) previously filled them with NaNs due to some incorrect pointer math. The models are a byproduct of an approximation algorithm which I applied to the Mistral model. Roughly, they degrade in quality as a function of the compression rate: 1/2, 7/16, 3/8, 5/16, 1/4.

Validation

To validate my observation, I performed the following sanity check:

delete my local copies of my experimental repos
clone them at the most recent commit
verify each model loads with AutoClasses
for all weight matrices, calculate the relative error against Mistral (Frobenius norm)

This validates my belief that the repos are in a valid state.

Hypothesis

My suspicion is that most of the models ran while the LLM Benchmark's copy of the corresponding repos were in incomplete or invalid states. I was doing a lot of concurrent operations (fixing the NaNs, uploading models, committing, pushing, enqueing models to the leaderboard, etc) and did delete some of the repos while diagnosing the errors I was getting while pushing updates. That's my flub, I'm just doing this as a hobby research project.

EDIT: removed unformatted $ symbols

clefourrier

Open LLM Leaderboard org Aug 5

What I mean is that we evaluated them with the precise commits that you gave above - so the checkpoints downloaded were those of the respective commits.

I'm OK with relaunching one of the finished models above, just in case there was indeed a problem with the repo - but I don't see how that could be the case unless you rewrote your model commit history.

awnr

Aug 5

•

edited Aug 5

I'm OK with relaunching one of the finished models above, just in case there was indeed a problem with the repo - but I don't see how that could be the case unless you rewrote your model commit history.

I would appreciate it. The 1-over-2 model is probably the best one to try.

For some added context, I conducted this experiment on the old leaderboard while prototyping and there, the algorithm produced models with similar quality to the original model. I'm not sure what's different this time.

EDIT: typo

awnr

Aug 5

Extra detail: I uploaded the most recent safetensors files through the website instead of the CLI. There were no manual edits to the commit history.

awnr

Aug 13

Hi @clefourrier . Could you take a look at the results files? I think the leaderboard may be showing the values from older runs.

I scrolled through:

and on cursory inspection, the values are similar. The older, NaN-filled 1-over-2 results are worse, by cursory comparison.

The models I uploaded appear on the leaderboard with:

For comparison, Mistral-7B-v0.1's results:

I'm running out the door, but I can peek the other results files this evening.

clefourrier

Open LLM Leaderboard org Aug 14

•

edited Aug 14

Hi @awnr , please open an new issue for this next time as it seems unrelated to the first one - it will make it easier for us to track things. Thanks for the report however!

cc @alozowski , maybe there's an issue in the parser?

alozowski

Open LLM Leaderboard org Aug 14

Curious! I will check the parser

awnr

Aug 16

@alozowski Would it be better if I restart from these instructions?

Based on these 2 columns from these 6 results files, I think they all ran correctly, but 1-over-2, 7-over-16, 3-over-8, and 1-over-4 don't display correctly on the leaderboard:

model	`leaderboard_bbh`	`leaderboard_gpqa`
`mistralai/Mistral-7B-v0.1`	0.4422843256379101	0.29194630872483224
`awnr/Mistral-7B-v0.1-signtensors-1-over-2`	0.442110744662385	0.3070469798657718
`awnr/Mistral-7B-v0.1-signtensors-7-over-16`	0.43065440027772955	0.3036912751677852
`awnr/Mistral-7B-v0.1-signtensors-3-over-8`	0.4289185905224787	0.3036912751677852
`awnr/Mistral-7B-v0.1-signtensors-5-over-16`	0.41103975004339527	0.28104026845637586
`awnr/Mistral-7B-v0.1-signtensors-1-over-4`	0.34941850373199096	0.2701342281879195

alozowski

Open LLM Leaderboard org Aug 16

Hi @awnr ,

Thank you that you helped to find the Leaderboard bug! Unfortunately, our parser didn't take into account new runs but now it should be correct, could you please check your models now?

alozowski

Open LLM Leaderboard org Aug 21

Seems like the problem was solved, I'm closing this discussion for inactivity. Please, feel free to open a new one in case of any questions!

alozowski changed discussion status to closed Aug 21