re-evaluating failed models after fixing the NaN weights

#869
by awnr - opened

Would someone re-evaluate 4 of my models?

In late June, I uploaded and submitted a few models. Unfortunately, I did some bad pointer math while writing the safetensors files and filled them with NaNs. The benchmarks gave bad results. I have now fixed the NaN problem in the safetensors files. In the table below, I list the models and the most recent commit hash. Unfortunately, only 1 of them seems to have been evaluated correctly (thanks @clefourrier for the suggestion to make an Issue and for re-evaluating the model that explicitly failed).

Most likely, the other 4 models ran with either an empty repo or a cache of the old NaN-filled safetensors files.

model hash
❌ awnr/Mistral-7B-v0.1-signtensors-1-over-2 98d8ea1dedcbd1f0406d229e45f983a0673b01f4
❌ awnr/Mistral-7B-v0.1-signtensors-7-over-16 084bbc5b3d021c08c00031dc2b9830d41cae068d
❌ awnr/Mistral-7B-v0.1-signtensors-3-over-8 bb888e45945f39e6eb7d23f31ebbff2e38b6c4f2
✅ awnr/Mistral-7B-v0.1-signtensors-5-over-16 5ea13b3d0723237889e1512bc70dae72f71884d1
❌ awnr/Mistral-7B-v0.1-signtensors-1-over-4 0a90af3d9032740d4c23f0ddb405f65a2f48f0d4
Open LLM Leaderboard org

Hi!
Thanks for opening an issue.
Please provide the request files again, as requested in the FAQ.
Btw, I think I already checked this in the other issue, and the models indeed ran with the correct commits; can you check in the request files?

Hey @clefourrier I really appreciate your feedback. I'm adding/clarifying some details.

Request Files

Hi!
Thanks for opening an issue.
Please provide the request files again, as requested in the FAQ.

Sure thing!

Clarification

Btw, I think I already checked this in the other issue, and the models indeed ran with the correct commits; can you check in the request files?

Thanks. I think 5-over-16 ran correctly. I do see that the models' statues are FINISHED (with the exception of *1-over-4 which appears to have been manually set to PENDING, though it's not visible in the queues). However, for 4 of the 5 models, the results don't improve after fixing the NaNs in the safetensors files.

image.png

My script (now fixed) previously filled them with NaNs due to some incorrect pointer math. The models are a byproduct of an approximation algorithm which I applied to the Mistral model. Roughly, they degrade in quality as a function of the compression rate: 1/2, 7/16, 3/8, 5/16, 1/4.

Validation

To validate my observation, I performed the following sanity check:

  • delete my local copies of my experimental repos
  • clone them at the most recent commit
  • verify each model loads with AutoClasses
  • for all weight matrices, calculate the relative error against Mistral (Frobenius norm)

This validates my belief that the repos are in a valid state.

Hypothesis

My suspicion is that most of the models ran while the LLM Benchmark's copy of the corresponding repos were in incomplete or invalid states. I was doing a lot of concurrent operations (fixing the NaNs, uploading models, committing, pushing, enqueing models to the leaderboard, etc) and did delete some of the repos while diagnosing the errors I was getting while pushing updates. That's my flub, I'm just doing this as a hobby research project.

EDIT: removed unformatted $ symbols

Open LLM Leaderboard org

What I mean is that we evaluated them with the precise commits that you gave above - so the checkpoints downloaded were those of the respective commits.

I'm OK with relaunching one of the finished models above, just in case there was indeed a problem with the repo - but I don't see how that could be the case unless you rewrote your model commit history.

I'm OK with relaunching one of the finished models above, just in case there was indeed a problem with the repo - but I don't see how that could be the case unless you rewrote your model commit history.

I would appreciate it. The 1-over-2 model is probably the best one to try.

For some added context, I conducted this experiment on the old leaderboard while prototyping and there, the algorithm produced models with similar quality to the original model. I'm not sure what's different this time.

EDIT: typo

Extra detail: I uploaded the most recent safetensors files through the website instead of the CLI. There were no manual edits to the commit history.

Hi @clefourrier . Could you take a look at the results files? I think the leaderboard may be showing the values from older runs.

I scrolled through:

and on cursory inspection, the values are similar. The older, NaN-filled 1-over-2 results are worse, by cursory comparison.

The models I uploaded appear on the leaderboard with:
image.png

For comparison, Mistral-7B-v0.1's results:
image.png

I'm running out the door, but I can peek the other results files this evening.

Open LLM Leaderboard org
edited Aug 14

Hi @awnr , please open an new issue for this next time as it seems unrelated to the first one - it will make it easier for us to track things. Thanks for the report however!

cc @alozowski , maybe there's an issue in the parser?

Open LLM Leaderboard org

Curious! I will check the parser

Open LLM Leaderboard org

Hi @awnr ,

Thank you that you helped to find the Leaderboard bug! Unfortunately, our parser didn't take into account new runs but now it should be correct, could you please check your models now?

Open LLM Leaderboard org

Seems like the problem was solved, I'm closing this discussion for inactivity. Please, feel free to open a new one in case of any questions!
Screenshot 2024-08-21 at 14.31.53.png

alozowski changed discussion status to closed

Sign up or log in to comment