AetherArchitectural/Community-Discussions · [solved] The Llama3 and GGUF saga. [transfered-discussion]

AetherArchitectural org May 10, 2024

Transferring FantasiaFoundry/GGUF-Quantization-Script/discussions/29 to this repo.

SolidSnacke
3 days ago

Found this post on Reddit. Looks like we have some other problems.
https://www.reddit.com/r/LocalLLaMA/comments/1cltac3/part3_cause_to_issue_found_possible_bug_llama3/

SolidSnacke
3 days ago

Judging by what I read, there are some problems with the regular expression, it is necessary to replace the regular expression in this line: https://github.com/ggerganov/llama.cpp/issues/7062#issuecomment-2097033871

FantasiaFoundry
Owner
2 days ago

@Lewdiculous

"To live is to suffer..."

Lewdiculous
2 days ago

@ABX-AI @Virt-io @Nitral-AI

We can't catch a break. Eventually will redo and re-upload quants but I'll let the dust settle.

Virt-io
2 days ago

I just wish they found a way to fix this and port it to convert.py

Lewdiculous
2 days ago

I know right... Honestly I'm just happy quants will be even better then. At least will try to look at the positive side.

SolidSnacke
2 days ago

Personally, I’ll probably take a short break from all this and see in a couple of days what changes. It would certainly be nice if they finally solved the problems with this, because it’s unpleasant to realize that at some point I began to involuntarily create slightly broken models.

SolidSnacke
about 20 hours ago

@Lewdiculous The topic was recently closed. One of the users in that thread suggested this solution to the problem of data loss.
https://github.com/ggerganov/llama.cpp/issues/7062#issuecomment-2102007944
Although I certainly could have mixed something up, if so I apologize in advance.

SolidSnacke
about 16 hours ago

@Lewdiculous I've probably already pissed you off, but I can't stop following the changes. That's what they did.
https://github.com/ggerganov/llama.cpp/releases/tag/b2831

Lewdiculous
about 1 hour ago

Not at all, thanks for keeping my posted on this. Will look into it later and see how it goes.

Lewdiculous changed discussion title from The Llama3 and GGUF saga continues. to The Llama3 and GGUF saga continues. [transfered-discussion] May 10, 2024

Lewdiculous

AetherArchitectural org May 10, 2024

Things should be good to go as is.

Hardeh

May 10, 2024

So now only to wait for koboldcpp to update?

Lewdiculous

AetherArchitectural org May 10, 2024

I believe the issue was only related to the conversion process, inference should be as expected.

SolidSnacke

May 10, 2024

I wouldn’t be surprised if we have to slightly redo the method of transforming the model, as it was suggested in one of the comments on this topic that I sent above.

Hardeh

May 10, 2024

So, if i understand correctly, all old GGUFs of llama3 should be reconverted now to work as intended?

SolidSnacke

May 10, 2024

Most likely, if the format was bf16, because if it is incorrectly converted to f16, data loss will occur.

SolidSnacke

May 10, 2024

I’ll say right away that I’m only writing what I read on github, but it looks like I’ll have to first make an f32 model, and then convert it to f16 in order to either avoid data loss or reduce losses. But I'm not entirely sure about this.

65 hidden messages

Expand all

jeiku

May 12, 2024

@Lewdiculous So uh as a guy who doesn't even know how to quantize his own models anymore (Thanks Obama,) is it easier to work with bf16 or fp16? I can produce both as long as I start out knowing which one to use.

Lewdiculous

AetherArchitectural org May 13, 2024

•

edited May 13, 2024

@jeiku
Keeping things as they are now is fine. Others also don't have to and probably won't change their formats just for that, so let's leave just one semi-universal quantization script as it is now, for us hobbyists I think it's alright.

Quanters will just need to be more careful. We have it too easy anyways, amirite? KEKW

At least things are always exciting. Surely...

As of right now the process generates one FP16-GGUF and one BF16-GGUF from the HF-Model, the F16 is used for the imatrix data generation, and the BF16 is used for the rest of the quantization process itself.

For Imatrix quants we use both anyways and it always needs to be converted, it's not a big deal in the grand scheme of things, and eventually llama.cpp should solve this problem by supporting BF16 for GPU inference.

Other than the additional wear, it's not much more painful to do, not enough to warrant changes to the process, and besides, I'd prefer that you focus on making the best datasets and unhinged models - that format perfectly and never misplace asterisks or quotations (aye, aye, jk!) - you can, and not worrying about this. At least for now. If things change we'll complain later.

With that said, I'd prefer BF16 (for the final finished models) to have the smallest loss possible when going into the imatrix generation. If anyone is seeing this...

jeiku

May 13, 2024

@jeiku
Keeping things as they are now is fine. Others also don't have to and probably won't change their formats just for that, so let's leave just one semi-universal quantization script as it is now, for us hobbyists I think it's alright.

Quanters will just need to be more careful. We have it too easy anyways, amirite? KEKW

At least things are always exciting. Surely...

As of right now the process generates one FP16-GGUF and one BF16-GGUF from the HF-Model, the F16 is used for the imatrix data generation, and the BF16 is used for the rest of the quantization process itself.

For Imatrix quants we use both anyways and it always needs to be converted, it's not a big deal in the grand scheme of things, and eventually llama.cpp should solve this problem by supporting BF16 for GPU inference.

Other than the additional wear, it's not much more painful to do, not enough to warrant changes to the process, and besides, I'd prefer that you focus on making the best datasets and unhinged models - that format perfectly and never misplace asterisks or quotations (aye, aye, jk!) - you can, and not worrying about this. At least for now. If things change we'll complain later.

With that said, I'd prefer BF16 (for the final finished models) to have the smallest loss possible when going into the imatrix generation. If anyone is seeing this...

Aw damn, I just did this one in FP16: https://huggingface.co/ChaoticNeutrals/Puppy_Purpose_0.69 and was hoping you could do the quant for me. I quanted using the convert.py method just to make sure it could form sentences (And because even though I know it is flawed, I've never had a problem with the outputs,) and this one has passed my testing. I checked briefly for quote/asterisk but it seems to fail that as it puts asterisks inside quotes. With some time I could breed that out, but it wasn't my focus for this model. Next time though!

Lewdiculous

AetherArchitectural org May 13, 2024

•

edited May 13, 2024

Oh, it's okay, it will be done too, can just be more lossy for the imatrix generation but not a big deal, like Virt said you can even use a Q8 for Imatrix, I just prefer using the full model for less guilty conscience - self induced.

I'll ping you when it's uploading.

jeiku

May 13, 2024

thanks bruv

Nitral-AI

May 13, 2024

•

edited May 13, 2024

The name kills me my man. Submit it to chai for the memes, i wanna see how it performs please. @jeiku

jeiku

May 13, 2024

The name kills me my man. Submit it to chai for the memes, i wanna see how it performs please. @jeiku

just submitted

Lewdiculous changed discussion title from The Llama3 and GGUF saga continues. [transfered-discussion] to [solved] The Llama3 and GGUF saga. [transfered-discussion] May 23, 2024

Lewdiculous

AetherArchitectural org Jun 2, 2024

This will be marked as closed now since the issue seems to be delt with for good.

🤞

Lewdiculous changed discussion status to closed Jun 2, 2024