https://huggingface.co/mradermacher/gemma-4-31B-it-qat-q4_0-unquantized-uncensored-heretic-i1-GGUF
Model: gemma-4-31B-it-qat-q4_0-unquantized-uncensored-heretic.i1-IQ3_XXS.gguf
https://huggingface.co/mradermacher/gemma-4-31B-it-qat-q4_0-unquantized-uncensored-heretic-i1-GGUF
Hi @mradermacher ,
I believe there may be an issue with the IQ3_XXS quant specifically.
Environment:
llama.cpp b9653
RTX 5070 Ti 16GB
Windows 11
Tested both with and without MTP
Tested in both llama.cpp and LM Studio
Issue:
The model consistently outputs leaked reasoning/control tokens such as:
<|channel>thought
<|channel>
<|channel>thought<|channel>
and frequently falls into repetition loops instead of generating a normal response.
Example prompt:
Hello.
Output:
[Start thinking]
<|channel>thought
<|channel>thought
...
(repeats indefinitely)
I can reproduce this with:
reasoning on
reasoning off
MTP enabled
MTP disabled
8k context
24k context
Most importantly, the issue occurs across multiple backends (llama.cpp and LM Studio).
Comparison:
The IQ3_XS version of the same model works normally on the same system and with the same settings. Other Gemma 4 31B models (e.g. your Sphinsikus quant) also work correctly.
Because the issue is isolated to the IQ3_XXS quant and persists across different inference engines, I suspect there may be a problem with this specific quantization or GGUF conversion.
Thank you for your time, and for all the work you've put into creating and maintaining these quants.
can you check IQ2_* versions too please? if the issue persists, then issue with the fact that q4_0 qat training work with ~4bit quants but breaks on lower quants because it doesnt know how to behave?someone already noticed IQ3_XXS broken, but we kinda both forgot about other quants
can you check IQ2_* versions too please? if the issue persists, then issue with the fact that q4_0 qat training work with ~4bit quants but breaks on lower quants because it doesnt know how to behave?someone already noticed IQ3_XXS broken, but we kinda both forgot about other quants
I tested the IQ2_XXS and IQ2_M variants of the model to compare against the previously reported IQ3_XXS issue.
Environment: llama.cpp b9653 RTX 5070 Ti 16GB Windows 11
MTP disabled, reasoning off
IQ2_XXS Issue:
The model exhibits complete generation instability.
Issue type:
Severe repetition loops / runaway repetition
Loss of sentence structure and instruction adherence
IQ2_M Issue:
The model exhibits token-level instability and collapse.
Issue type:
Token fragmentation
Multilingual subword noise / incoherent token mixing
Loss of grammatical and semantic structure
Comparison:
IQ3_XS: stable
IQ3_XXS: formatting/control token leakage + repetition instability
IQ2_XXS: full generative collapse (looping)
IQ2_M: token-space fragmentation / incoherent generation

So because model is qat all quants that are not 4 bits or above will fail, thank you for letting me know