[solved] The Llama3 and GGUF saga. [transfered-discussion]

#10
by Lewdiculous - opened
AetherArchitectural org

Transferring FantasiaFoundry/GGUF-Quantization-Script/discussions/29 to this repo.

SolidSnacke
3 days ago

Found this post on Reddit. Looks like we have some other problems.
https://www.reddit.com/r/LocalLLaMA/comments/1cltac3/part3_cause_to_issue_found_possible_bug_llama3/


SolidSnacke
3 days ago

Judging by what I read, there are some problems with the regular expression, it is necessary to replace the regular expression in this line: https://github.com/ggerganov/llama.cpp/issues/7062#issuecomment-2097033871


FantasiaFoundry
Owner
2 days ago

@Lewdiculous

"To live is to suffer..."


Lewdiculous
2 days ago

@ABX-AI @Virt-io @Nitral-AI

We can't catch a break. Eventually will redo and re-upload quants but I'll let the dust settle.


Virt-io
2 days ago

I just wish they found a way to fix this and port it to convert.py


Lewdiculous
2 days ago

I know right... Honestly I'm just happy quants will be even better then. At least will try to look at the positive side.


SolidSnacke
2 days ago

Personally, I’ll probably take a short break from all this and see in a couple of days what changes. It would certainly be nice if they finally solved the problems with this, because it’s unpleasant to realize that at some point I began to involuntarily create slightly broken models.


SolidSnacke
about 20 hours ago

@Lewdiculous The topic was recently closed. One of the users in that thread suggested this solution to the problem of data loss.
https://github.com/ggerganov/llama.cpp/issues/7062#issuecomment-2102007944
Although I certainly could have mixed something up, if so I apologize in advance.


SolidSnacke
about 16 hours ago

@Lewdiculous I've probably already pissed you off, but I can't stop following the changes. That's what they did.
https://github.com/ggerganov/llama.cpp/releases/tag/b2831


Lewdiculous
about 1 hour ago

Not at all, thanks for keeping my posted on this. Will look into it later and see how it goes.

Lewdiculous changed discussion title from The Llama3 and GGUF saga continues. to The Llama3 and GGUF saga continues. [transfered-discussion]
AetherArchitectural org

Things should be good to go as is.

So now only to wait for koboldcpp to update?

AetherArchitectural org

I believe the issue was only related to the conversion process, inference should be as expected.

I wouldn’t be surprised if we have to slightly redo the method of transforming the model, as it was suggested in one of the comments on this topic that I sent above.

So, if i understand correctly, all old GGUFs of llama3 should be reconverted now to work as intended?

Most likely, if the format was bf16, because if it is incorrectly converted to f16, data loss will occur.

I’ll say right away that I’m only writing what I read on github, but it looks like I’ll have to first make an f32 model, and then convert it to f16 in order to either avoid data loss or reduce losses. But I'm not entirely sure about this.

I also wanted to ask about this. How important is this change?
https://github.com/ggerganov/llama.cpp/releases/tag/b2839

I don't think that affects us roleplayers.
That is for people who use grammar as a way to constrain llm outputs.
Also that seems like a runtime bug, not a gguf bug.

AetherArchitectural org
edited May 10

Having to perform 2 conversions before the quant step will be proper annoying, realistically they will eventually fix this, surely? They have to, right? Please? Well, even if the solution is just enforcing the 2 step conversion anyways...

@Lewdiculous I'd like to try doing 2 conversions, just for experimentation. Can you explain how to rewrite your script for this?

AetherArchitectural org
edited May 11

@SolidSnacke - Ehh, I'd say it's best to wait for the situation to settle but you'd need to use the convert-hf script to generate the F32, then use it with quantize.exe to get the F16 as usual, from there proceeding as usual with the imatrix generation and further quants after that, checks can be done for both the F32 and F16 before continuing to skip that step if they are already present... But I don't look forward to covering that, you can try your hand though.

Understood, thanks.

AetherArchitectural org
edited May 11

It's annoying to handle but you'd be covering for the possibility that they don't incorporate that officially with a flag or something. This situation is pretty annoying.

AetherArchitectural org

I'm a bit lost, @SolidSnacke was this what we were waiting for?

https://github.com/ggerganov/llama.cpp/pull/7158

@Lewdiculous I'm a little confused, is it possible to immediately convert the bf16 model to bf16.gguf?
Because in the 'testing' paragraph it is said that in one and the other case the checksums are the same.

AetherArchitectural org
edited May 11

That's what I was coping for, converting the HF bf16 directly to the GGUF F16. @Virt-io you're smarter - stronger, faster, taller... - can you tell if I'm just hoping for too much?

Not 100% sure, but this is what I gathered.


@SolidSnacke
Yes, this allows for the direct conversion of bf16 safetensors to bf16 gguf.
However, we still don't have cuda inference for bf16.


@Lewdiculous
Wasn't direct to FP16 already possible?
This was added to stop BF16 -> FP16.
The idea is to keep the model in BF16, and quantize from BF16.


From the PR

This means bf16 tensor data made with convert-hf-to-gguf.py should match exactly what ./quantize produces from f32 models.

I'm trying to test this now. I ran the script three times, twice my computer gave me a blue screen of death. Now for the third time everything seems to be fine, I’ll write later what kind of crap I got.

When I try to create imatrix.dat from bf16.gguf it produces this line and after it the file creation is interrupted.
GGML_ASSERT: D:\a\llama.cpp\llama.cpp\ggml-cuda.cu:1277: to_fp32_cuda != nullptr

And how do I run this on the CPU?

AetherArchitectural org
edited May 11

@SolidSnacke Does changing -ngl to 0 work?

I'll try it now

Same error. I'll try to remove the -ngl line.

Remove the flash attention line as well

Do you mean -f?

AetherArchitectural org
edited May 11

Wasn't direct to FP16 already possible?

@Virt-io
It was possible but supposedly lossy. So BF16 → FP32 → FP16 seems to be the lossless path, I was hoping this would save us from this horrible fate of writing 32GB more to disk with each model (8B) conversion.

-f is for file

-fa is flash attention

@Lewdiculous

It was possible but supposedly lossy. So BF16 → FP32 → FP16 seems to be the lossless path, I was hoping this would save us from this horrible fat writing 32GB more to disk with each model conversion.

It will when inference works on bf16.

I mean technically you could convert.
safensors -> bf16

gguf bf16 -> fp16

run imatrix on fp16

quantize from bf16 gguf.

That should be fine as you could even do imatrix on Q8_0 and be fine.

AetherArchitectural org

Soon™

Of course. Of course.

-fa is not in the script. And I tried removing the line with -ngl, same error.

AetherArchitectural org
edited May 11

@SolidSnacke I think for testing try to use the llamacpp scripts directly outside of the script for each step, the executables are in the bin folder.

Try --no-kv-offload

Why are you trying to do this exactly?

I'll try

I'm just experimenting.

AetherArchitectural org
edited May 11

Why are you trying to do this exactly?

Adjusting the conversion script to produce (finally) lossless GGUFs of Llama 3s

Wait, I thought?!

You can still use an old imatrix file.

This means bf16 tensor data made with convert-hf-to-gguf.py should match exactly what ./quantize produces from f32 models.

This was the main win, quantizations from BF16 gguf should be identical to quantizations from F32 gguf

Unless there was another tokenizer fix? I haven't been keeping up.

AetherArchitectural org
edited May 11

But then it means we're still stuck in BF16 → F32 → F16 then until there's GPU inference for BF16 GGUFs.

Honestly there was a lot going on recently I'm lost but it seems we're still doomed, for now.

I don't think the losses from BF16 gguf -> F16 are that bad.
Since you will only be using F16 for the generation of the .dat
And, quantizing from the BF16 gguf

Damn, we should just wait for everything.
I just realized that this would mean pretty much the same disk usage as just quanting to F32.
:|

AetherArchitectural org

Well then. Currently the process was to:
HF BF16 → F16 GGUF → Imatrix from F16 GGUF→ ./quantize from the F16 GGUF into the other lower quants.

Since the quantization is an entirely CPU process, then yes, we can just use the BF16 with ./quantize.

So we're working with both, for different steps...

AetherArchitectural org

It is a predicament. Considering there hasn't been any glaring issues with quants I take the stance of waiting and seeing. BF16 GPU inference would solve all this.

On the bright side, they did implement lazy mode on convert-hf.py

AetherArchitectural org
edited May 11

Well then. Currently the process was to:
HF BF16 → F16 GGUF → Imatrix from F16 GGUF→ ./quantize from the F16 GGUF into the other lower quants.

Since the quantization is an entirely CPU process, then yes, we can just use the BF16 with ./quantize.

So we're working with both, for different steps...

@SolidSnacke Can you test run this mess?

AetherArchitectural org

On the bright side, they did implement lazy mode on convert-hf.py

Using a bit more RAM wasn't the issue KEKW

It was for me. cries

@Lewdiculous Now I’ll try, since I can’t run imatrix.exe only on the CPU, there’s stupidly no command for this, or I’m just blind.

If everything works out, I’ll send the corrected version of the script, and you’ll see what horror is written there.

I'm an idiot, I downloaded the wrong model. I'm sorry, you'll have to wait some more time.

AetherArchitectural org
edited May 11

No rush, and don't worry about horrors, it's already bad enough it can't get worse. As long as it works it's more than fine.

🤝

Result: I was at least able to create imatrix.dat from f16.gguf and quantized models from bf16.gguf. But I haven't tested them yet.

@Lewdiculous I sent a draft version of the script to the repository. Take a look when you have time.

I checked... at least the q5_k_m model produces good first messages, in my opinion. I’ll probably now upload the models, imatrix.dat and imatrix.txt, to my repository, so that if anyone has time, they can look at everything for themselves.

AetherArchitectural org
edited May 12

Thanks a lot for the testing and improvements so far, I know these experiments take time.

Instead of replacing the existing script I think I'll have it as a new one with lossless in the name to differentiate it, just in case.

Honestly I think the differences from what we had to the "proper" way are already so small, it might just be hard to tell at this point, at least for my use cases.

Yeah, it's better to keep them separate so you don't have to do too many steps to roll back changes.

AetherArchitectural org
edited May 12

@Virt-io - It's not so bad, well it does write a whole bunch extra to the disk, but, my RAM utilization with some layers offloaded is about 21GB/32GB, and you know what they say (Microsoft), unused RAM is wasted RAM!

The poor disks. They cry. Geez, that's CPU intensive now!

@Lewdiculous So uh as a guy who doesn't even know how to quantize his own models anymore (Thanks Obama,) is it easier to work with bf16 or fp16? I can produce both as long as I start out knowing which one to use.

AetherArchitectural org
edited May 13

@jeiku
Keeping things as they are now is fine. Others also don't have to and probably won't change their formats just for that, so let's leave just one semi-universal quantization script as it is now, for us hobbyists I think it's alright.

Quanters will just need to be more careful. We have it too easy anyways, amirite? KEKW

At least things are always exciting. Surely...


As of right now the process generates one FP16-GGUF and one BF16-GGUF from the HF-Model, the F16 is used for the imatrix data generation, and the BF16 is used for the rest of the quantization process itself.

For Imatrix quants we use both anyways and it always needs to be converted, it's not a big deal in the grand scheme of things, and eventually llama.cpp should solve this problem by supporting BF16 for GPU inference.

Other than the additional wear, it's not much more painful to do, not enough to warrant changes to the process, and besides, I'd prefer that you focus on making the best datasets and unhinged models - that format perfectly and never misplace asterisks or quotations (aye, aye, jk!) - you can, and not worrying about this. At least for now. If things change we'll complain later.


With that said, I'd prefer BF16 (for the final finished models) to have the smallest loss possible when going into the imatrix generation. If anyone is seeing this...

@jeiku
Keeping things as they are now is fine. Others also don't have to and probably won't change their formats just for that, so let's leave just one semi-universal quantization script as it is now, for us hobbyists I think it's alright.

Quanters will just need to be more careful. We have it too easy anyways, amirite? KEKW

At least things are always exciting. Surely...


As of right now the process generates one FP16-GGUF and one BF16-GGUF from the HF-Model, the F16 is used for the imatrix data generation, and the BF16 is used for the rest of the quantization process itself.

For Imatrix quants we use both anyways and it always needs to be converted, it's not a big deal in the grand scheme of things, and eventually llama.cpp should solve this problem by supporting BF16 for GPU inference.

Other than the additional wear, it's not much more painful to do, not enough to warrant changes to the process, and besides, I'd prefer that you focus on making the best datasets and unhinged models - that format perfectly and never misplace asterisks or quotations (aye, aye, jk!) - you can, and not worrying about this. At least for now. If things change we'll complain later.


With that said, I'd prefer BF16 (for the final finished models) to have the smallest loss possible when going into the imatrix generation. If anyone is seeing this...

Aw damn, I just did this one in FP16: https://huggingface.co/ChaoticNeutrals/Puppy_Purpose_0.69 and was hoping you could do the quant for me. I quanted using the convert.py method just to make sure it could form sentences (And because even though I know it is flawed, I've never had a problem with the outputs,) and this one has passed my testing. I checked briefly for quote/asterisk but it seems to fail that as it puts asterisks inside quotes. With some time I could breed that out, but it wasn't my focus for this model. Next time though!

AetherArchitectural org
edited May 13

Oh, it's okay, it will be done too, can just be more lossy for the imatrix generation but not a big deal, like Virt said you can even use a Q8 for Imatrix, I just prefer using the full model for less guilty conscience - self induced.

I'll ping you when it's uploading.

thanks bruv

The name kills me my man. Submit it to chai for the memes, i wanna see how it performs please. @jeiku

The name kills me my man. Submit it to chai for the memes, i wanna see how it performs please. @jeiku

just submitted

Lewdiculous changed discussion title from The Llama3 and GGUF saga continues. [transfered-discussion] to [solved] The Llama3 and GGUF saga. [transfered-discussion]
AetherArchitectural org

This will be marked as closed now since the issue seems to be delt with for good.

🤞

Lewdiculous changed discussion status to closed

Sign up or log in to comment