Spaces:
Running
on
CPU Upgrade
re-evaluating failed models after fixing the NaN weights
Would someone re-evaluate 4 of my models?
In late June, I uploaded and submitted a few models. Unfortunately, I did some bad pointer math while writing the safetensors
files and filled them with NaN
s. The benchmarks gave bad results. I have now fixed the NaN
problem in the safetensors
files. In the table below, I list the models and the most recent commit hash. Unfortunately, only 1 of them seems to have been evaluated correctly (thanks
@clefourrier
for the suggestion to make an Issue and for re-evaluating the model that explicitly failed).
Most likely, the other 4 models ran with either an empty repo or a cache of the old NaN
-filled safetensors
files.
model | hash |
---|---|
❌ awnr/Mistral-7B-v0.1-signtensors-1-over-2 | 98d8ea1dedcbd1f0406d229e45f983a0673b01f4 |
❌ awnr/Mistral-7B-v0.1-signtensors-7-over-16 | 084bbc5b3d021c08c00031dc2b9830d41cae068d |
❌ awnr/Mistral-7B-v0.1-signtensors-3-over-8 | bb888e45945f39e6eb7d23f31ebbff2e38b6c4f2 |
✅ awnr/Mistral-7B-v0.1-signtensors-5-over-16 | 5ea13b3d0723237889e1512bc70dae72f71884d1 |
❌ awnr/Mistral-7B-v0.1-signtensors-1-over-4 | 0a90af3d9032740d4c23f0ddb405f65a2f48f0d4 |
Hi!
Thanks for opening an issue.
Please provide the request files again, as requested in the FAQ.
Btw, I think I already checked this in the other issue, and the models indeed ran with the correct commits; can you check in the request files?
Hey @clefourrier I really appreciate your feedback. I'm adding/clarifying some details.
Request Files
Hi!
Thanks for opening an issue.
Please provide the request files again, as requested in the FAQ.
Sure thing!
- awnr/Mistral-7B-v0.1-signtensors-1-over-2
- awnr/Mistral-7B-v0.1-signtensors-7-over-16
- awnr/Mistral-7B-v0.1-signtensors-3-over-8
- awnr/Mistral-7B-v0.1-signtensors-1-over-4
Clarification
Btw, I think I already checked this in the other issue, and the models indeed ran with the correct commits; can you check in the request files?
Thanks. I think 5-over-16
ran correctly. I do see that the models' statues are FINISHED
(with the exception of *1-over-4
which appears to have been manually set to PENDING
, though it's not visible in the queues). However, for 4 of the 5 models, the results don't improve after fixing the NaN
s in the safetensors
files.
My script (now fixed) previously filled them with NaN
s due to some incorrect pointer math. The models are a byproduct of an approximation algorithm which I applied to the Mistral
model. Roughly, they degrade in quality as a function of the compression rate: 1/2, 7/16, 3/8, 5/16, 1/4.
Validation
To validate my observation, I performed the following sanity check:
- delete my local copies of my experimental repos
- clone them at the most recent commit
- verify each model loads with
AutoClasses
- for all weight matrices, calculate the relative error against
Mistral
(Frobenius norm)
This validates my belief that the repos are in a valid state.
Hypothesis
My suspicion is that most of the models ran while the LLM Benchmark's copy of the corresponding repos were in incomplete or invalid states. I was doing a lot of concurrent operations (fixing the NaNs, uploading models, committing, pushing, enqueing models to the leaderboard, etc) and did delete some of the repos while diagnosing the errors I was getting while pushing updates. That's my flub, I'm just doing this as a hobby research project.
EDIT: removed unformatted $ symbols
What I mean is that we evaluated them with the precise commits that you gave above - so the checkpoints downloaded were those of the respective commits.
I'm OK with relaunching one of the finished models above, just in case there was indeed a problem with the repo - but I don't see how that could be the case unless you rewrote your model commit history.
I'm OK with relaunching one of the finished models above, just in case there was indeed a problem with the repo - but I don't see how that could be the case unless you rewrote your model commit history.
I would appreciate it. The 1-over-2
model is probably the best one to try.
For some added context, I conducted this experiment on the old leaderboard while prototyping and there, the algorithm produced models with similar quality to the original model. I'm not sure what's different this time.
EDIT: typo
Extra detail: I uploaded the most recent safetensors
files through the website instead of the CLI. There were no manual edits to the commit history.
Hi
@clefourrier
. Could you take a look at the results
files? I think the leaderboard may be showing the values from older runs.
I scrolled through:
and on cursory inspection, the values are similar. The older, NaN
-filled 1-over-2
results are worse, by cursory comparison.
The models I uploaded appear on the leaderboard with:
For comparison, Mistral-7B-v0.1
's results:
I'm running out the door, but I can peek the other results files this evening.
Hi @awnr , please open an new issue for this next time as it seems unrelated to the first one - it will make it easier for us to track things. Thanks for the report however!
cc @alozowski , maybe there's an issue in the parser?
Curious! I will check the parser
@alozowski Would it be better if I restart from these instructions?
Based on these 2 columns from these 6 results files, I think they all ran correctly, but 1-over-2
, 7-over-16
, 3-over-8
, and 1-over-4
don't display correctly on the leaderboard: