https://huggingface.co/mradermacher/gemma-4-31B-it-qat-q4_0-unquantized-uncensored-heretic-i1-GGUF

#2579
by swishgumbo - opened

Model: gemma-4-31B-it-qat-q4_0-unquantized-uncensored-heretic.i1-IQ3_XXS.gguf
https://huggingface.co/mradermacher/gemma-4-31B-it-qat-q4_0-unquantized-uncensored-heretic-i1-GGUF

Hi @mradermacher ,

I believe there may be an issue with the IQ3_XXS quant specifically.

Environment:

llama.cpp b9653
RTX 5070 Ti 16GB
Windows 11
Tested both with and without MTP
Tested in both llama.cpp and LM Studio
Issue:
The model consistently outputs leaked reasoning/control tokens such as:

<|channel>thought
<|channel>
<|channel>thought<|channel>

and frequently falls into repetition loops instead of generating a normal response.

Example prompt:

Hello.

Output:

[Start thinking]
<|channel>thought
<|channel>thought
...
(repeats indefinitely)

I can reproduce this with:

reasoning on
reasoning off
MTP enabled
MTP disabled
8k context
24k context
Most importantly, the issue occurs across multiple backends (llama.cpp and LM Studio).

Comparison:
The IQ3_XS version of the same model works normally on the same system and with the same settings. Other Gemma 4 31B models (e.g. your Sphinsikus quant) also work correctly.

Because the issue is isolated to the IQ3_XXS quant and persists across different inference engines, I suspect there may be a problem with this specific quantization or GGUF conversion.

Thank you for your time, and for all the work you've put into creating and maintaining these quants.

can you check IQ2_* versions too please? if the issue persists, then issue with the fact that q4_0 qat training work with ~4bit quants but breaks on lower quants because it doesnt know how to behave?someone already noticed IQ3_XXS broken, but we kinda both forgot about other quants

can you check IQ2_* versions too please? if the issue persists, then issue with the fact that q4_0 qat training work with ~4bit quants but breaks on lower quants because it doesnt know how to behave?someone already noticed IQ3_XXS broken, but we kinda both forgot about other quants

I tested the IQ2_XXS and IQ2_M variants of the model to compare against the previously reported IQ3_XXS issue.

Environment: llama.cpp b9653 RTX 5070 Ti 16GB Windows 11
MTP disabled, reasoning off

IQ2_XXS Issue:

The model exhibits complete generation instability.

Issue type:

Severe repetition loops / runaway repetition
Loss of sentence structure and instruction adherence

IQ2_M Issue:

The model exhibits token-level instability and collapse.

Issue type:

Token fragmentation
Multilingual subword noise / incoherent token mixing
Loss of grammatical and semantic structure

Comparison:
IQ3_XS: stable
IQ3_XXS: formatting/control token leakage + repetition instability
IQ2_XXS: full generative collapse (looping)
IQ2_M: token-space fragmentation / incoherent generation
Screenshot 2026-06-17 060924
Screenshot 2026-06-17 060406

So because model is qat all quants that are not 4 bits or above will fail, thank you for letting me know

Sign up or log in to comment