Concerned about the use of PPL scores

#1
by FallenMerick - opened
But... what about Q8?

The mountain moved:

150 points better: PPL = 8.5850 +/- 0.05881 VS: BASE/ORGINAL: PPL = 8.6012 +/- 0.05900

Looking at these numbers, it appears that 150 points = 0.015 PPL. If this is the case, then the margin of error on these scores is +/- 588 points, and the 150 points of improvement is well within that margin of error, essentially making it meaningless as a marker for the improvement itself.

Its possible that this 32 bit remaster is indeed much more intelligent and capable than its 16 bit counterpart, but it doesn't appear that this improvement is being indicated within the PPL scores themselves. I would love to see some other benchmark differences between these two versions of PsyCet, and potentially some A/B blind testing as well if possible.

Completely agree - that is why you need to look at all the scores of all quants.
It is margin of movement.
If only a few quants show change, it is error - if they all show change it is movement.
Likewise the +- is RANGE, not just margin of error.
And like perplexity, range is also an average - a terrible indicator of true mathematical movement especially in something as large and complex as a LLM.

The actual range amount is also a factor, which is relative to the base level perplexity - in this case 8-9.
IE: if the range was say 1.000 or higher you have a possible unstable model.
However if the base range is 15 ; a "error" range of 1 would be no issue.

This is because perplexity is not linear, it is closer to levels of magnitude.
It is only a relative, rough "30000 feet" view of the model.

And then there is the file used to calculate perplexity itself.
A wild card.
Change this, you change everything.

The original creator of this model tested it himself.
To put it mildly it was blown away. His comments are all over his discord - KoboldAI.
Members of his group familiar with the model also tested it too.
Without exception all of them were impressed.
Likewise a lot of real world testing - original and new and improved where done prior to release to further confirm the change to the positive change so to speak.
These methods confirm or deny perplexity changes and likewise reveal positive and/or negative changes as well.

AS per Jeb Carter, creator of the model:

  • instruction following has improved dramatically.
  • new abilities have emerged.
  • he had to REDUCE the instructions sets used because the model no longer needed as specific instructions.
  • prose, nuance and depth have all improved.
  • issues with the original model have disappeared.

This is not "something for nothing" ; it is method of ensuring maximum precision at every step just before "ggufing" the model.
The methods employed only ensure precision loss is minimized or eliminated.
It is mathematical and theory sound.

I believe what is being said, and it absolutely makes sense that maintaining 32 bit precision from the beginning would lead to better overall precision in the final product. I just wanted to point out why I don't believe PPL is very useful as a measurement for the improvement being seen.

I appreciate the insight!

the model stheno by sao10k is really good. it's score almost 70 at leaderboard while just having 8B parameter. is it possible to make fp32 version too from that model ?

Owner

I am aware of this model. If I recall there are a number of versions?
Do you mean one in particular?

from his post, it seem like 3.2 is not his final model. i dont know, maybe he will update the model. but now it already outperform even 70B models from my experience. it will hit quite a punch now since the 3.2 version is kinda popular rn if there is a fp32 version.

Sign up or log in to comment