https://huggingface.co/mradermacher/gemma-4-31B-it-qat-q4_0-unquantized-uncensored-heretic-i1-GGUF

#2579

by swishgumbo - opened 9 days ago

Model: gemma-4-31B-it-qat-q4_0-unquantized-uncensored-heretic.i1-IQ3_XXS.gguf
https://huggingface.co/mradermacher/gemma-4-31B-it-qat-q4_0-unquantized-uncensored-heretic-i1-GGUF

Hi @mradermacher ,

I believe there may be an issue with the IQ3_XXS quant specifically.

Environment:

llama.cpp b9653
RTX 5070 Ti 16GB
Windows 11
Tested both with and without MTP
Tested in both llama.cpp and LM Studio
Issue:
The model consistently outputs leaked reasoning/control tokens such as:

<|channel>thought
<|channel>
<|channel>thought<|channel>

and frequently falls into repetition loops instead of generating a normal response.

Example prompt:

Hello.

Output:

[Start thinking]
<|channel>thought
<|channel>thought
...
(repeats indefinitely)

I can reproduce this with:

reasoning on
reasoning off
MTP enabled
MTP disabled
8k context
24k context
Most importantly, the issue occurs across multiple backends (llama.cpp and LM Studio).

Comparison:
The IQ3_XS version of the same model works normally on the same system and with the same settings. Other Gemma 4 31B models (e.g. your Sphinsikus quant) also work correctly.

Because the issue is isolated to the IQ3_XXS quant and persists across different inference engines, I suspect there may be a problem with this specific quantization or GGUF conversion.

Thank you for your time, and for all the work you've put into creating and maintaining these quants.

RichardErkhov

9 days ago

can you check IQ2_* versions too please? if the issue persists, then issue with the fact that q4_0 qat training work with ~4bit quants but breaks on lower quants because it doesnt know how to behave?someone already noticed IQ3_XXS broken, but we kinda both forgot about other quants

swishgumbo

9 days ago

can you check IQ2_* versions too please? if the issue persists, then issue with the fact that q4_0 qat training work with ~4bit quants but breaks on lower quants because it doesnt know how to behave?someone already noticed IQ3_XXS broken, but we kinda both forgot about other quants

I tested the IQ2_XXS and IQ2_M variants of the model to compare against the previously reported IQ3_XXS issue.

Environment: llama.cpp b9653 RTX 5070 Ti 16GB Windows 11
MTP disabled, reasoning off

IQ2_XXS Issue:

The model exhibits complete generation instability.

Issue type:

Severe repetition loops / runaway repetition
Loss of sentence structure and instruction adherence

IQ2_M Issue:

The model exhibits token-level instability and collapse.

Issue type:

Token fragmentation
Multilingual subword noise / incoherent token mixing
Loss of grammatical and semantic structure

Comparison:
IQ3_XS: stable
IQ3_XXS: formatting/control token leakage + repetition instability
IQ2_XXS: full generative collapse (looping)
IQ2_M: token-space fragmentation / incoherent generation

RichardErkhov

9 days ago

So because model is qat all quants that are not 4 bits or above will fail, thank you for letting me know

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment