Request 2 models - Increase performance (issue with Imat ? - maybe affecting older quants - many models)
Hey;
This is a strange request, but based on some investigation and the quant request you did for me - thanks again!
It seems "older" quants may need requanting - especially Imatrix - .
I have tested models quanted some time ago , against "quanted today" so to speak.
There is a stark difference in operation ; in some cases quants that are broken or close to broken (especially at low BPW), now operate perfectly or close to perfect after being re-quanted (same imat dataset / .dat file) AT LOW BPW.
(I suspect "higher" BPW operation has also improved too)
This seems to hit LLAMA2 and/or MOE models especially hard - and likewise the improvement is huge.
(still investigating this, maybe other ARCH(s) too)
The quant you did for me in the past day:
https://huggingface.co/mradermacher/bagel-dpo-8x7b-v0.2-i1-GGUF
Operates well at IQ1_S , and perfectly fine at IQ1_M (with my enhancements it is +++ on top of this)... at 42 t/s
I also downloaded a " LLaMA2-13B-Tiefighter " (9 months old) - it to was barely usable/broken at low BPW ; then I requanted (with IMAT - same old ".dat" file) and it work even at minimal BPW ; whereas before it was broken.
I do not know how wide spread this issue is.
I am testing / re-quanting some "older" quanted models this week and will see how that goes.
This could be all "older" MOES (3+ months old? older?) and/or L2s ... or more widespread.
It looks like improvements in core LLAMACPP are helping - both in quanting (and IMAT) and running the quant.
And/or some improvement(s) maybe negatively affecting "older quants" of a model (???)
If I could please ask you to re-quant these two models:
https://huggingface.co/OmnicromsBrain/NeuralStar_AlphaWriter_4x7b
https://huggingface.co/mlabonne/Beyonder-4x7B-v2
As they are also have performance issues / not working (or poor) at LOW BPW - in current form.
(quanted 6 and 9 months ago respectively)
NOTE:
Downloaded a very recent quant of another "MOE" (8x7b) -> Tested, works fine / great at IQ1_S / IQ1_M
(prometheus-8x7b-v2.0.i1) and... with enhancements even better.
This is what lead me to requesting "bagel-dpo-8x7b-v0.2" = which now runs perfectly / off the scale at even LOW BPW.
RE: Enhancements:
I will be releasing a paper/doc about the enhancements / how to apply them shortly.
(public - enhances all models)
Thank you in advance.
All Mixtral 8x7B based models older than a few months are known to have issues due to some breaking llama.cpp changes. We are aware of it and many of them where already requantized. Please report any such models we missed so we can requantize them.
Do you get any warning in the llama.cpp console when opening degraded non Mixtral 8x7B based models? Can you try open those models with an old version of llama.cpp released at the time they were quantized? llama.cpp sometimes implement changes breaking old models. This compatibility issues usually mainly affects MoE models and models quantized shortly after a new base model was released.
Just tested a "Command-R 35B" (I quanted 5 months ago) (imatrix) against a new one quanted today (LOW BPW, to highlight changes).
Definite performance increase. In some cases startling.
NOTE: Was using the "old" .dat file from previous imatrix 5 months ago.
Running more tests ;
RE: Llamacpp -; Just running the models and realized the output / gen issues.
Likewise there is an uptick in instruction following from MOES from 6/9 months ago to ones made very recently.
I am aware of MOE changes/issues ; but this is showing up with L2s (broken/barely working => good to excellent), now Command-R (medium to strong performance increase).
Older mistrals (7B) also seem to be affected, unclear how much - more testing needed.
Did same test -> Old quant download, test then NEWLY quanted / test (strong to drastically better) => still have more to do here to confirm.
Not sure at this point how strong the changes/improvements are at mid level quants / top quants yet.
Still mapping things out / testing / quanting...
NOTE: The two models I noted above (org discussion post) have performance issues / broken.
https://huggingface.co/OmnicromsBrain/NeuralStar_AlphaWriter_4x7b
https://huggingface.co/mlabonne/Beyonder-4x7B-v2
Update:
Spot checked MythoMax 13B ; IQ1_S ; completely broken, new one today -> works perfectly.
FILE SIZE (old): 3,065,049
FILE SIZE (new): 2,830,749
Could not compare IQ1_Ms ; was not available at repo. One I built today also works perfectly.
Spot checked the Q6 (non-imatrix):
OLD: SIZE: 10,617,598 || PPL = 6.2452 +/- 0.10317 || 21.46 T/s
Today: PPL = 6.2467 +/- 0.10319 || SIZE: 10,428,849 || 21.86 t/s
Note file size difference.
Also did a "temp=0" core test to see gen differences - yep, they are different there too.
The Q6 show some differences ; generational speaking new is a hair better (?) - both versions are working correctly.
I am just giving you a heads up ; the broken ones are... well... make everyone look bad.
The poor souls new to LLMs have no idea, download them and broken ... blame the model / repo / open source ?
I want to be clear, I am not blaming or pointing fingers - just food for thought.
Anyhow, I would not have even noticed this if I was not researching/testing low BPW models right now.
Based I what I found, I will likely be re-quanting a lot of my top models at my repo - especially Imatrix ones...
Thank you again for all you do;
As nico has pointed out, yes, there are known issues specifically with moe models, with mixtral models, and general bugs. Lots :)
It is a moving target, and low-bpp quants had had a number of regressions and improvements over the last months. We have requanted the more important models, but it's very hard to know which ones have been affected and which are not - in general, the quants were the state of the art at that particular time, i.e. it was what you got back when the model came out. What's worse, there are regressions that make older models worse, or break them completely, too.
If you find models that are affected, I'll be happy to requant, so keep them coming. I'll redo the ones you mentioned.
Thanks so much. I will let you know.
The paper/doc I am working in part looks at using low BPW quants using some simple augmentation (no model modification or quanting changes.)
It also looks / reveals methods for reg / top quant augmentation too.
Just FYI:
For "non broken" models (Imatrix) it seems that generation is about the same, but instruction following has significantly improved.
Tested more Mythomax "new vs old" quants - change showed up during testing.
BTW; this quant you did:
https://huggingface.co/mradermacher/bagel-dpo-8x7b-v0.2-i1-GGUF
IQ1M operates at incredible power levels ; just shocking how well it works at this tiny BPW quant level.
Yeah, although with many models, you start to get typos and worse with iq2_x* and lower. One could venture the theory that older model lose less because they have less to start with, similarly to larger models being able to withstand higher quantisation errors, seemingly because they are less sensitive/dense. All street knowledge only, of course, which is usually wrong anyway :)
But imatrix quantisation has improved in multiple steps since february or so - the current quant quality for these is substantially better than what it was in the beginning, even if you ignore the regressions for some model types.
According to bartowski the latest breaking changes seam to only affect split tensor MoE models which mainly affects to old mixtral models: https://huggingface.co/posts/bartowski/894091265291588
Old mixtral model quants may be broken!
Recently Slaren over on llama.cpp refactored the model loader - in a way that's super awesome and very powerful - but with it came breaking of support for "split tensor MoE models", which applies to older mixtral models
You may have seen my upload of one such older mixtral model, ondurbin/bagel-dpo-8x7b-v0.2, and with the newest changes it seems to be able to run without issue
If you happen to run into issues with any other old mixtral models, drop a link here and I'll try to remake them with the new changes so that we can continue enjoying them :)
The doc/paper will address methods to actively (and automatically) correct the issues with IQ1s and IQ2s - in other words turn them into coherent models at this quant level.
Roughly, it will force the models to make better decisions per token (cumulative too) to correct and in many cases completely eliminate the "gibberish, repeats, and generally poor generations" at IQ1s, and IQ2s. This radically addresses the root issues associated with low BPW in real time.
Last week this was "theory", now it is real, and I am mapping the various ARCHS / Quants / model parameters for maximum effect as well as fine tuning.
Roughly : 7B+ IQ1S operate, 3B+ IQ1M operate, and all model sizes operate from IQ2XXS with the methods applied.
There are also some outliers ; IE GEMMA2 (9B) operates at IQ1S without additional "help" so to speak.
The key was the calibration methods and steps... and "hacking" LLamacpp, koboldcpp, and text gen webui to see programmatically how the processing of token generation was actually occurring (I have a background in programming) then applying this, along with ah... a fair bit of prompt/generational experience.
The other side of the equation: IQ3+, IQ4/Q4 and up - optimizations also improve generational performance too - right up and including Q8.
Great to hear. I'm for sure really interested in your paper. I spent the past few month and thousands of GPU hours measuring the quality of all the quants of meany models. In case you are interested I posted the measurements of the entire Qwen 2.5-series of models 3 days ago to https://huggingface.co/mradermacher/BabyHercules-4x150M-GGUF/discussions/2#674a7958ce9bc37b8e33cf55 with raw data available under https://www.nicobosshard.ch/LLM-Eval_Quality_v1.tar.zst. Measurements and script to generate nice looking plots of previous models can be found under https://www.nicobosshard.ch/LLM-Eval_v2.tar.zst. For example for Llama 3.1 405B you can find the plots under https://huggingface.co/mradermacher/BabyHercules-4x150M-GGUF/discussions/2#6732972fc7b41d86099eb5d9. Regarding your measurement methodology I recommend to use Mean KLD (mean KL-divergence) and Mean Ξp (Correct token probability) instead of Perplexity as they give a much better indication of quality that can be compared across multiple models and versions of llama.cpp. If you need any scripts to automate quality measurements just let me know and I can share mine.
@nicoboss
Excellent! ; that is fantastic.
Quick look - yes , mirrors "raw testing" (IQ1s,IQ2s) results before "optimization".
Your charts will help with the fine tuning too
IE prevent overcooking the optimization and track operational/performance "line" -
thanks this is a great help!
HMM ; IQ1S vs IQ1M
IQ1M is twice as strong (good?) relative to IQ1S.
This is true for both instruction following and generational output when comparing these two.
Heads up:
Models below 1B (gguf) are handicapped.
This is part due to programming in llamacpp / less than 1B.
If you test source VS Q8 - this may appear.
I don't know if this is still the case. This was a few months ago.
RE: PPL
- yes, it sucks. A single number to determine that "quality" or "not" of a multi-billion "dynamic equation" ... ; fought against that for a while.
Only use PPL to measure relative change ; then back this up with real world testing. (pos/neg results and/or changes).
"Correct token probability" IS the critical issue with IQ1s (S and M) and IQ2s (limited).
Literally the core problem.
The "comprehension" curve occurs at/about IQ2s quants - XXS, XS, S .
At IQ2_M full stability (instruction/generation) occurs for all parameter sized models from 1B and up.
To be clear, 1B really needs at least IQ3_M; however everything 3B or high all operate well at IQ2XXS and up.
With MOES: 8X7B -> IQ1M ; 4X7B -> IQ1M, but IQ2XXS is better.
As noted, GEMMA 9B is unusually strong, even at IQ1S , without optimization.
Other ARCHs run the gambit slightly.
The doc/paper will address methods to actively (and automatically) correct the issues with IQ1s and IQ2s
Hmm, from what I saw, the problem would affect imatrix generation, too, or am I reading it wrong? If the imatrix is affected, you might be able to improve those quants, but the damage is done, and a new imatrix should perform better. And whatever correction you do should then help even more.
IT seems (at the moment), if I go back and use the "old" imatrix.dat file with the "newly quanted" model , the new model works better period.
I have not tested creating a brand new imat.dat file from source/fp16.gguf -> then applying it to a "new quant".
If I recall there has been some minor changes in imatrix.exe/llama-imatrix.exe - so maybe a boost there too?
(On the list to investigate).
SIDE NOTE:
Trying "hacking" imat files -> creating a "poor model" (weight/gating damage), creating a imat.dat file from this THEN applying the "imat.dat" it to a "good model" (same model name, size etc) - this works. This is a separate line of study.
Currently it looks like quants older than 60 days, (and definitely 90 days+) could use an immediate "REFRESH/RE-QUANT".
120+ days plus => big jump in performance. (for low BTW -> difference between broken or functional).
You are correct - the "corrections" methods rely on functioning quants and do even better with "freshly minted" quants (especially on the "instruction" following side)
However, everyone -right now- would benefit from "fresh quants" .
I am looking at which models (at my repo) I will re-quant and re-upload at the moment.
Need to do a bit more testing too - especially the "re-imatrix" performance boost (or not).
UPDATE:
Tested generation of a new IMAT.DAT and new quant, with today's LLAMAcpp release.
Model: Mythomax 13B L2.
Confirmed:
-> New IMAT.dat is different than one made in july 2024. (same root text file for both)
-> New quant PPL is LOWER VS using "old" IMAT.dat (july 2024). (230 points lower , IQ1 S)
-> Operation of model is different (root test and at "temp").
-> Seems better instruction and output generation. (relative to IQ1S version previously generated, using "old" Imat.dat from July 2024)
-> Instruction following correct and output generation coherent.
(I use a strong prompt with "errors" to test instruction following nuance, the prompt itself also pushes the model hard too.)
-> 59 t/s on low end Nvidia 16GB card (4060ti)
I did also run a test of generating the quant using the "old" IMAT.dat file, but using convert / quantize.exe from today's LLAMAcpp release too.
Wow is all I can say :) Refreshing all quants will, of course, be out of the question, for all models. I will, however, happily requant anything that's requested or otherwqise comes to my attention that would benefit.
Yeah... ; 30 - 60 days of uploads over here - ahh... gonna be real picky for my models/quants.
But ... 13000+ models. BOOM.
There is the other side of the coin - a lot of people (based on forum/discord chats/comments) usually go for the newest models.
They don't know what they are missing... well sometimes.
Might send you a few more for quanting/re-quanting ; go some hardware limits here ;
Oddly... 70B models at IQ1S , a number work well ; with enhancements very well.
Think I know how to address the "older" quants via slightly modified enhancements. hmm.
Added:
The three quants you did - top shelf. Excellent.
The two re-quants - incredible different in quality VS "older quants".
Oddly... 70B models at IQ1S , a number work well ; with enhancements very well.
Maybe related, but we recently kicked out static IQ3* quants, because measurements indicated that they were absolutely abysmal and pointless (much worse then Q_2_K), while the imatrix ones performed as expected. Recently nicoboss made more measurements of Qwen2.5, and there the effect seems to not exist.
It seems performance can be very model specific.
Also from personal experience, there are some models where the IQ1 quants work almost perfectly well, at least for creative work, where truth doesn't matter.
As for the MOE models, the MoE bug was pretty disastrous for the resulting model quality. I would expect that this is not the case for non-MoE models.
YEs... the MOES "bug" => Model was DOA.
There is barely functional (can work with that), then "broken"... MOEs were worse than broken.
Considering how powerful MOEs are ... this a sad stain on them.
Frankly I had forgotten how powerful these were / are - with new IMAT + New quant => these perform at or above newest "arch" levels.
So far, across the board (doesn't matter what arch):
New quant W new Imatrix (that is a new .dat file) (current LLAMAcpp release)
=> IQ1s work, in some cases very well (relative to other quants), IQ2s -> could be used for normal everyday usage.
Prior to this IQ1s would not function at all or barely, IQ2s would work "okayish".
In fact (at the time) I would not even quant them nor recommend them.
Based on what I am seeing - that is going to change.
Likewise for Q2K (non imatrix) - new Q2k (non imatrix) => Jump in performance.
The increase in instruction following is particularly powerful , especially at low BPW.
That change / increase in performance is a game changer all by itself.
In terms of generation -> Increase instruction following, results in better generation.
Likewise there is an uptick in detail, nuance, and "understanding" in output too.
Also noticed a increase in general model stability , which mirrors performance gains at IQ1/IQ2 .
With that being said, there are performance gains across all quants.
Based on a quick look at LLAMAcpp ; it seems there was significant changes to both IMAT and reg quants in the past 2 months.
This is based on "history of changes" for IMATRIX.EXE and QUANTIZE.EXE at the gitbub LLAMACPP.
NOTE: Currently questions about older "mistral" (7Bs) - they are up next for analysis (need new quant and NEW Imat.dat files).
However at 95-98 t/s (IQ1s) / 90 t/s (IQ2s) - they are freakish fast, with 32k context.
(they work with my new optimizations at this BPW level ; without these opts (IQ1s) => they implode // at IQ2s , work barely , with my opts they operate very well)
Draft speed on crack.
I have a feeling with new IMAT .DAT + NEW quant => Performance will only get better here.
RE: NON imatrix IQ3s - got a few to "work" ... barely. Yes, awful would be the word here.
ADDED:
RE: Oddly... 70B models at IQ1S , a number work well ; with enhancements very well.
My initial survey (of 29 70B models at IQ1S) was as follows:
25% worked well, the other 50% worked okay to barely passable, and 25% were broken (but MAY work).
I have these in spreadsheet with notes.
These were NOT with my optimizations.
Going to retest these models (and other 70Bs) with optimizations.
I have tested 2 70B models so far with optimizations
-> excellent results so far. ; but more tuning still to go. (there is a parameter component)