[solved] The Llama3 and GGUF saga. [transfered-discussion]
Transferring FantasiaFoundry/GGUF-Quantization-Script/discussions/29 to this repo.
SolidSnacke
3 days ago
Found this post on Reddit. Looks like we have some other problems.
https://www.reddit.com/r/LocalLLaMA/comments/1cltac3/part3_cause_to_issue_found_possible_bug_llama3/
SolidSnacke
3 days ago
Judging by what I read, there are some problems with the regular expression, it is necessary to replace the regular expression in this line: https://github.com/ggerganov/llama.cpp/issues/7062#issuecomment-2097033871
FantasiaFoundry
Owner
2 days ago
"To live is to suffer..."
Lewdiculous
2 days ago
We can't catch a break. Eventually will redo and re-upload quants but I'll let the dust settle.
Virt-io
2 days ago
I just wish they found a way to fix this and port it to convert.py
Lewdiculous
2 days ago
I know right... Honestly I'm just happy quants will be even better then. At least will try to look at the positive side.
SolidSnacke
2 days ago
Personally, I’ll probably take a short break from all this and see in a couple of days what changes. It would certainly be nice if they finally solved the problems with this, because it’s unpleasant to realize that at some point I began to involuntarily create slightly broken models.
SolidSnacke
about 20 hours ago
@Lewdiculous
The topic was recently closed. One of the users in that thread suggested this solution to the problem of data loss.
https://github.com/ggerganov/llama.cpp/issues/7062#issuecomment-2102007944
Although I certainly could have mixed something up, if so I apologize in advance.
SolidSnacke
about 16 hours ago
@Lewdiculous
I've probably already pissed you off, but I can't stop following the changes. That's what they did.
https://github.com/ggerganov/llama.cpp/releases/tag/b2831
Lewdiculous
about 1 hour ago
Not at all, thanks for keeping my posted on this. Will look into it later and see how it goes.
Things should be good to go as is.
So now only to wait for koboldcpp to update?
I believe the issue was only related to the conversion process, inference should be as expected.
I wouldn’t be surprised if we have to slightly redo the method of transforming the model, as it was suggested in one of the comments on this topic that I sent above.
So, if i understand correctly, all old GGUFs of llama3 should be reconverted now to work as intended?
Most likely, if the format was bf16, because if it is incorrectly converted to f16, data loss will occur.
I’ll say right away that I’m only writing what I read on github, but it looks like I’ll have to first make an f32 model, and then convert it to f16 in order to either avoid data loss or reduce losses. But I'm not entirely sure about this.
@Lewdiculous So uh as a guy who doesn't even know how to quantize his own models anymore (Thanks Obama,) is it easier to work with bf16 or fp16? I can produce both as long as I start out knowing which one to use.
@jeiku
Keeping things as they are now is fine. Others also don't have to and probably won't change their formats just for that, so let's leave just one semi-universal quantization script as it is now, for us hobbyists I think it's alright.
Quanters will just need to be more careful. We have it too easy anyways, amirite? KEKW
At least things are always exciting. Surely...
As of right now the process generates one FP16-GGUF and one BF16-GGUF from the HF-Model, the F16 is used for the imatrix data generation, and the BF16 is used for the rest of the quantization process itself.
For Imatrix quants we use both anyways and it always needs to be converted, it's not a big deal in the grand scheme of things, and eventually llama.cpp should solve this problem by supporting BF16 for GPU inference.
Other than the additional wear, it's not much more painful to do, not enough to warrant changes to the process, and besides, I'd prefer that you focus on making the best datasets and unhinged models - that format perfectly and never misplace asterisks or quotations (aye, aye, jk!) - you can, and not worrying about this. At least for now. If things change we'll complain later.
With that said, I'd prefer BF16 (for the final finished models) to have the smallest loss possible when going into the imatrix generation. If anyone is seeing this...
@jeiku
Keeping things as they are now is fine. Others also don't have to and probably won't change their formats just for that, so let's leave just one semi-universal quantization script as it is now, for us hobbyists I think it's alright.Quanters will just need to be more careful. We have it too easy anyways, amirite? KEKW
At least things are always exciting. Surely...
As of right now the process generates one FP16-GGUF and one BF16-GGUF from the HF-Model, the F16 is used for the imatrix data generation, and the BF16 is used for the rest of the quantization process itself.
For Imatrix quants we use both anyways and it always needs to be converted, it's not a big deal in the grand scheme of things, and eventually llama.cpp should solve this problem by supporting BF16 for GPU inference.
Other than the additional wear, it's not much more painful to do, not enough to warrant changes to the process, and besides, I'd prefer that you focus on making the best datasets and unhinged models - that format perfectly and never misplace asterisks or quotations (aye, aye, jk!) - you can, and not worrying about this. At least for now. If things change we'll complain later.
With that said, I'd prefer BF16 (for the final finished models) to have the smallest loss possible when going into the imatrix generation. If anyone is seeing this...
Aw damn, I just did this one in FP16: https://huggingface.co/ChaoticNeutrals/Puppy_Purpose_0.69 and was hoping you could do the quant for me. I quanted using the convert.py method just to make sure it could form sentences (And because even though I know it is flawed, I've never had a problem with the outputs,) and this one has passed my testing. I checked briefly for quote/asterisk but it seems to fail that as it puts asterisks inside quotes. With some time I could breed that out, but it wasn't my focus for this model. Next time though!
Oh, it's okay, it will be done too, can just be more lossy for the imatrix generation but not a big deal, like Virt said you can even use a Q8 for Imatrix, I just prefer using the full model for less guilty conscience - self induced.
I'll ping you when it's uploading.
thanks bruv
The name kills me my man. Submit it to chai for the memes, i wanna see how it performs please. @jeiku
This will be marked as closed now since the issue seems to be delt with for good.
🤞