coder3101 heretic variants of QAT Gemma4: 31B, 26B-A4B, 12B, E4B, E2B

#2509
by thirteenbit - opened

For proper operation, this models requires only Q4_0 quantization. Any other option will degrade the quality. Please pay attention to this - people do not understand that the QAT model is trained to work only in 4-bit, and 6-bit layers will not make the quality better - they will make it worse.

@Theory-of-mind : Are you sure?

The linked models are BF16 versions of 4-bit QAT-trained Gemma 4 models.
Heretic abliteration does not change this significantly AFAIK.

Based on this explanation of QAT, and as we are quantizing the BF16 weights rather than re-quantizing an already 4-bit model, higher precision should still work.

I don't see how moving to 6-bit would actively degrade quality compared to 4-bit.

In the worst-case scenario, 6-bit quality would be similar to 4-bit, meaning we'd just be wasting storage and VRAM.

@thirteenbit I am not an ML engineer and do not have 100% accurate information, but I do know the following.

  1. Unsloth (who loves to improve quants) quantized the QAT model using Q4_0.
  2. Just yesterday, ReadyArt tried using regular LoRA with a QAT model and quantized it to Q4_K_M (a mixture of 4-bit and 6-bit layers). The model was completely broken. He requantized it to Q4_0, which improved (but still didn't work, since LoRA needs to be trained with a QAT model from the start).
    Even if this was a measurement error, 6-bit layers won't give you anything at best, but there's reason to believe they could even worsen the overall quality of the model.

QAT4 means that the model was trained to work in a 4-bit noise environment.

P.S. Regarding abliteration, I am also interested in how it fits in with the QAT model.

I see, there's https://unsloth.ai/docs/models/gemma-4/qat#qat-analysis published.

According to this:

The main issue is converting from QAT BF16 to llama.cpp's Q4_0 format is not lossless.
llama.cpp uses F16 scales, whilst QAT BF16 uses BF16 scales, and the scales are not determined optimally in llama.cpp land.

Problem is not with the quantization bpw but with the llama-quantize using fp16 instead of fp32 or bf16 somewhere.

Hopefully unsloth team will propose improvements to llama.cpp team.

Re: abliteration, as it targets as small as possible change to the base model it's probably not a problem for QAT: most of the weights should remain the same.
More interesting question is if abliteration was run with COT on or in instruct only mode, I've just seen somewhere somebody forgetting to run smaller Gemma 4 heretic with thinking on, result was abliterated instruct mode and refusals when thinking. That's probably only a problem with small models that default to instruct mode until told otherwise.

Edit: found it, here, v1.1 thinking mode fix: https://huggingface.co/igorls/gemma-4-12B-it-qat-q4_0-unquantized-heretic

I found that Unsloth's Q4_0 has higher accuracy than llama.cpp's Q8_0. On the other hand, llama.cpp's Q4_0 is much worse than llama.cpp's Q8_0. Gemma 4's QAT quantization seems to be really tricky.

I found that Unsloth's Q4_0 has higher accuracy than llama.cpp's Q8_0

This is specifically for Gemma 4 QAT?
And for all 5 variants (E2B, E4B, 12B, 26B A4B, 31B)?

Because looking at unsloth's data Gemma 4 12B and 26B results look strangely low, even after unsloth's fixes, there are also some comments here regarding this:

https://www.reddit.com/r/unsloth/comments/1txqnyq/gemma4_qat_unsloth_accuracy_recovery_for_ggufs/

If I understand correctly, different model architectures, so probably it's harder to QAT train MoE (26B A4B) and 12B is new architecture (no encoders, maybe something else making training harder).

I can only say that in my experience, the 26B-A4B QAT4 is significantly better than the Q6_K. The QAT4 solved the puzzle where the Q6_K failed. In RP, QAT also performs better - it follows the system prompt more accurately, and there are fewer errors.

This is specifically for Gemma 4 QAT?
And for all 5 variants (E2B, E4B, 12B, 26B A4B, 31B)?

Yes, I only tested the E2B variant. I assumed the same would hold for the others, but looking at the graph in your link, that may not be the case for the 12B and 26B.

Looks like everything depends on the task.

Somebody trying to use Gemma 4 26B A4B for drawing SVG chessboard and looks like QAT is worse than non-QAT Q4_K_XL:

https://www.reddit.com/r/LocalLLaMA/comments/1tzib7d/qat_variant_of_gemma4_26b_a4b_is_not_working_well/

well uh, let me know guys if I should queue it or not, because there's a lot of contriversy as I see...

if I should queue it or not, because there's a lot of contriversy as I see...

The only question is if the unsloth's comment about llama-quantize losing precision somewhere will lead to changes that will require re-quantization?

I found no related discussions, issues or PR-s in llama.cpp's github related to this part of unsloth's Gemma 4 QAT

The main issue is converting from QAT BF16 to llama.cpp's Q4_0 format is not lossless.
llama.cpp uses F16 scales, whilst QAT BF16 uses BF16 scales, and the scales are not determined optimally in llama.cpp land.

And I see that most of the models are already done from requests in other discussions, thank you!

As per aifeifei,
https://huggingface.co/mradermacher/model_requests/discussions/2522#6a25f23f4a7c38997867deaf

Latest llama cpp should have fixed it. But I wait a few days and make an update for llama cpp. Some models might be picked up by others and queued, but I most probably will requant them myself later anyways to make sure quality is good

There are no fixes in llama.cpp for this, and it isn't a bug to begin with. llama-quantize has no awareness of QAT alignment.

In my testing, Q4_0 without an imatrix significantly degrades accuracy because it ignores QAT alignment entirely. Q4_0 with an imatrix, however, ends up aligning correctly as a side effect of how it adjusts errors based on the imatrix. In fact, Q4_0 with an imatrix achieved higher accuracy than Q8_0 without an imatrix.

As a result, i1-Q4_0 is unaffected by the llama.cpp issue Unsloth mentioned. You're safe to proceed with quantization as-is.

By the way, the mmproj looks like it is aligned for 8-bit.

Sign up or log in to comment