Instructions to use google/gemma-4-26B-A4B-it with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use google/gemma-4-26B-A4B-it with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("image-text-to-text", model="google/gemma-4-26B-A4B-it")# Load model directly from transformers import AutoModel model = AutoModel.from_pretrained("google/gemma-4-26B-A4B-it", dtype="auto") - Notebooks
- Google Colab
- Kaggle
- AMD Developer Cloud
- Local Apps
- vLLM
How to use google/gemma-4-26B-A4B-it with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "google/gemma-4-26B-A4B-it" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "google/gemma-4-26B-A4B-it", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }'Use Docker
docker model run hf.co/google/gemma-4-26B-A4B-it
- SGLang
How to use google/gemma-4-26B-A4B-it with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "google/gemma-4-26B-A4B-it" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "google/gemma-4-26B-A4B-it", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "google/gemma-4-26B-A4B-it" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "google/gemma-4-26B-A4B-it", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }' - Docker Model Runner
How to use google/gemma-4-26B-A4B-it with Docker Model Runner:
docker model run hf.co/google/gemma-4-26B-A4B-it
Very bad results with model quant and KV cache quant, only BF16 works well
Hello,
First of all, this model series is great. It's concise and precise when thinking. It's great at BF16. However, some community members have been showing that Gemma 4 is very sensitive to quant or KV cache quant. I have found similar results.
Do you have any recommendations on what people can do to quant the model and KV cache but at least get more favorable results than what I showed?
It's very possible that my testing is flawed, so I will provide detail. I am using llama-cpp's llama-perplexity to do some perplexity and KL divergence testing and here are some results: The data set is a smaller subset of wikitext-2 (half the size). Different datasets produce similar divergence. The model is gemma-4-26B-A4B-it-BF16.gguf running in llama-cpp.
Note that day 1 results of this were much worse (because attention rotation wasn't working), this is the latest llama-cpp as of 4/26/26 where attention rotation works with q8_0 and below.
I am also uploading screenshots since I can't upload a csv. The screenshots are of these results in a table format in case the pasted text below is hard to read.
Same top P comparison of model and KV cache in various formats:
KL divergence comparison of model and KV cache in various formats:
EDIT 1: It seems that going BF16 to ANYTHING tanks the stats. Going from F16 to anything lower is showing more standard results. E.g. F16 KV cache to Q8_0 KV cache is 1% difference in same top p.
Here model in BF16 with KV cache at BF16:
perplexity: calculating perplexity over 210 chunks, n_ctx=512, batch_size=2048, n_seq=4
Final estimate: PPL = 18485.4149 +/- 614.44801
Here's the model in BF16 with KV cache at F16 (note the same top p score):
====== Perplexity statistics ======
Mean PPL(Q) : 18152.525879 ± 603.267310
Mean PPL(base) : 7058.138346 ± 189.564955
Cor(ln(PPL(Q)), ln(PPL(base))): 93.80%
Mean ln(PPL(Q)/PPL(base)) : 0.944628 ± 0.012298
Mean PPL(Q)/PPL(base) : 2.571857 ± 0.031629
Mean PPL(Q)-PPL(base) : 11094.387533 ± 430.486841
====== KL divergence statistics ======
Mean KLD: 0.485541 ± 0.006790
Maximum KLD: 29.637039
99.9% KLD: 16.962774
99.0% KLD: 8.318924
95.0% KLD: 2.527373
90.0% KLD: 1.109702
Median KLD: 0.025509
10.0% KLD: 0.000095
5.0% KLD: 0.000010
1.0% KLD: 0.000000
0.1% KLD: -0.000001
Minimum KLD: -0.000238
====== Token probability statistics ======
Mean Δp: 0.065 ± 0.037 %
Maximum Δp: 99.995%
99.9% Δp: 90.877%
99.0% Δp: 26.504%
95.0% Δp: 3.657%
90.0% Δp: 0.588%
75.0% Δp: 0.002%
Median Δp: -0.000%
25.0% Δp: -0.002%
10.0% Δp: -0.575%
5.0% Δp: -3.570%
1.0% Δp: -25.089%
0.1% Δp: -88.887%
Minimum Δp: -100.000%
RMS Δp : 8.515 ± 0.148 %
Same top p: 81.675 ± 0.167 %
Here's the model at BF16 with KV cache at Q8_0.
====== Perplexity statistics ======
Mean PPL(Q) : 18279.907685 ± 607.276902
Mean PPL(base) : 7058.138346 ± 189.564955
Cor(ln(PPL(Q)), ln(PPL(base))): 93.59%
Mean ln(PPL(Q)/PPL(base)) : 0.951621 ± 0.012447
Mean PPL(Q)/PPL(base) : 2.589905 ± 0.032237
Mean PPL(Q)-PPL(base) : 11221.769338 ± 435.026501
====== KL divergence statistics ======
Mean KLD: 0.544280 ± 0.007331
Maximum KLD: 33.094250
99.9% KLD: 18.138201
99.0% KLD: 9.032127
95.0% KLD: 2.870639
90.0% KLD: 1.278298
Median KLD: 0.032150
10.0% KLD: 0.000118
5.0% KLD: 0.000013
1.0% KLD: 0.000000
0.1% KLD: -0.000001
Minimum KLD: -0.000309
====== Token probability statistics ======
Mean Δp: -0.016 ± 0.038 %
Maximum Δp: 99.997%
99.9% Δp: 90.437%
99.0% Δp: 25.569%
95.0% Δp: 3.978%
90.0% Δp: 0.649%
75.0% Δp: 0.002%
Median Δp: -0.000%
25.0% Δp: -0.002%
10.0% Δp: -0.596%
5.0% Δp: -3.728%
1.0% Δp: -27.902%
0.1% Δp: -88.848%
Minimum Δp: -99.998%
RMS Δp : 8.851 ± 0.150 %
Same top p: 80.196 ± 0.172 %
Next, here's the model at Q8_0 with KV cache at BF16.
====== Perplexity statistics ======
Mean PPL(Q) : 18851.154285 ± 627.063932
Mean PPL(base) : 7058.138346 ± 189.564955
Cor(ln(PPL(Q)), ln(PPL(base))): 92.92%
Mean ln(PPL(Q)/PPL(base)) : 0.982393 ± 0.012943
Mean PPL(Q)/PPL(base) : 2.670839 ± 0.034568
Mean PPL(Q)-PPL(base) : 11793.015939 ± 456.326011
====== KL divergence statistics ======
Mean KLD: 0.731529 ± 0.008635
Maximum KLD: 38.478340
99.9% KLD: 19.972658
99.0% KLD: 10.503417
95.0% KLD: 3.859527
90.0% KLD: 1.866741
Median KLD: 0.077997
10.0% KLD: 0.000327
5.0% KLD: 0.000036
1.0% KLD: 0.000001
0.1% KLD: -0.000001
Minimum KLD: -0.000042
====== Token probability statistics ======
Mean Δp: -0.110 ± 0.045 %
Maximum Δp: 100.000%
99.9% Δp: 97.303%
99.0% Δp: 33.279%
95.0% Δp: 5.334%
90.0% Δp: 0.846%
75.0% Δp: 0.002%
Median Δp: -0.000%
25.0% Δp: -0.003%
10.0% Δp: -1.022%
5.0% Δp: -6.211%
1.0% Δp: -36.593%
0.1% Δp: -96.947%
Minimum Δp: -99.995%
RMS Δp : 10.494 ± 0.154 %
Same top p: 75.501 ± 0.186 %
And finally, here's the model at Q8_0 and KV cache at Q8_0.
====== Perplexity statistics ======
Mean PPL(Q) : 19075.672627 ± 634.444853
Mean PPL(base) : 7058.138346 ± 189.564955
Cor(ln(PPL(Q)), ln(PPL(base))): 92.78%
Mean ln(PPL(Q)/PPL(base)) : 0.994233 ± 0.013040
Mean PPL(Q)/PPL(base) : 2.702649 ± 0.035244
Mean PPL(Q)-PPL(base) : 12017.534280 ± 464.000151
====== KL divergence statistics ======
Mean KLD: 0.756754 ± 0.008831
Maximum KLD: 42.110668
99.9% KLD: 19.646534
99.0% KLD: 10.752383
95.0% KLD: 4.045067
90.0% KLD: 1.934487
Median KLD: 0.082384
10.0% KLD: 0.000336
5.0% KLD: 0.000039
1.0% KLD: 0.000001
0.1% KLD: -0.000001
Minimum KLD: -0.000015
====== Token probability statistics ======
Mean Δp: -0.178 ± 0.045 %
Maximum Δp: 100.000%
99.9% Δp: 96.734%
99.0% Δp: 32.610%
95.0% Δp: 5.019%
90.0% Δp: 0.722%
75.0% Δp: 0.002%
Median Δp: -0.000%
25.0% Δp: -0.004%
10.0% Δp: -1.149%
5.0% Δp: -6.318%
1.0% Δp: -36.662%
0.1% Δp: -96.659%
Minimum Δp: -99.998%
RMS Δp : 10.388 ± 0.152 %
Same top p: 75.261 ± 0.186 %
I ran some quick tests on Gemma 4 E2B, Llama.cpp commit 8701bf075, but I couldn't replicate your results.
- Convert the Safetensor model to GGUF in both BF16 and FP16 formats
- Calculate the logits generated by the BF16 and FP16 models against a corpus
- Wikipedia's full Wolf article copied and pasted into a text file
- Ran at 4096 tokens of context
- Quantize Q8_0 versions from both BF16 and FP16 models
- Calculate the KLD divergence of the logits generated by these models compared to the logits generated previously by the source models.
| Source | KV | KLD 99 | KLD 99.9 |
|---|---|---|---|
| BF16 | FP16 | 0.000037 | 0.000052 |
| BF16 | Q8_0 | 0.019366 | 0.075579 |
| FP16 | FP16 | 0.000038 | 0.000052 |
| FP16 | Q8_0 | 0.009556 | 0.045717 |
| Diff¹ | Q8_0 | 0.015760 | 0.088068 |
| Diff¹ | FP16 | 0.011851 | 0.055280 |
¹: These rows compare the BF16 model to the FP16 logits, essentially showing the divergence between both formats.
Both BF16 and FP16 seem to be just as good retaining quality when quantizing a Q8_0 from these models and run at KV FP16, but FP16 does show a minuscule advantage when its Q8_0 is run at KV Q8_0. The difference is so minimal that I wouldn't even consider it, remember that these are KLD99 and KLD99.9 metrics so they showcase worst case scenarios.
I'd love to test using the 26B A4B IT model but I don't have the memory to calculate the logits from the 16-bit GGUFs... either I use swap and take forever, or I generate logits from the Q8_0 models resulting from the BF16 and FP16 GGUFs and compare the two. If the behavior you describe is consistent, the KLD divergences from the logits of both Q8_0 versions should be astronomical!
I'd love to test using the 26B A4B IT model but I don't have the memory to calculate the logits from the 16-bit GGUFs... either I use swap and take forever, or I generate logits from the Q8_0 models resulting from the BF16 and FP16 GGUFs and compare the two. If the behavior you describe is consistent, the KLD divergences from the logits of both Q8_0 versions should be astronomical!
Ended up doing exactly that with the 26B A4B model:
- Convert Safetensor model to GGUF in both BF16 and FP16 formats
- Quantize Q8_0 versions from both BF16 and FP16 models
- Calculate the logits generated by the FP16 -> Q8_0 model against a corpus
- Calculate the KLD divergence of the logits generated by the BF16 -> Q8_0 model against the previously calculated logits
This should demonstrate the direct difference between both versions of the Q8_0 models, one version being quantized from the BF16 and the other being quantized from the FP16.
| Source | KV | KLD 99 | KLD 99.9 |
|---|---|---|---|
| Diff | FP16 | 0.088909 | 0.592613 |
| Diff | Q8_0 | 8.338683 | 17.427561 |
Even though memory constraints don't allow me to test to the extent you did, the difference in logits from Q8_0 GGUFs quantized from both BF16 and FP16 is massive at KV FP16 and gargantual at KV Q8_0.
I can confirm there is something VERY WRONG going on here.
I'd love to test using the 26B A4B IT model but I don't have the memory to calculate the logits from the 16-bit GGUFs... either I use swap and take forever, or I generate logits from the Q8_0 models resulting from the BF16 and FP16 GGUFs and compare the two. If the behavior you describe is consistent, the KLD divergences from the logits of both Q8_0 versions should be astronomical!
Ended up doing exactly that with the 26B A4B model:
- Convert Safetensor model to GGUF in both BF16 and FP16 formats
- Quantize Q8_0 versions from both BF16 and FP16 models
- Calculate the logits generated by the FP16 -> Q8_0 model against a corpus
- Calculate the KLD divergence of the logits generated by the BF16 -> Q8_0 model against the previously calculated logits
This should demonstrate the direct difference between both versions of the Q8_0 models, one version being quantized from the BF16 and the other being quantized from the FP16.
Source KV KLD 99 KLD 99.9 Diff FP16 0.088909 0.592613 Diff Q8_0 8.338683 17.427561 Even though memory constraints don't allow me to test to the extent you did, the difference in logits from Q8_0 GGUFs quantized from both BF16 and FP16 is massive at KV FP16 and gargantual at KV Q8_0.
I can confirm there is something VERY WRONG going on here.
I had a reminder to respond to re-test today, so I'm glad you saved me some time by confirming the issue on 26B. Something is very strange here. After doing some more research it seems to be architecture choices in Gemma that are causing this. This post on reddit describes someone who wrote an inference engine and ran into many precision issues with gemma 4 - https://www.reddit.com/r/LocalLLaMA/comments/1sebwz2/got_gemma_4_running_locally_on_cuda_both_float/ Qwen3.5 and 3.6 don't suffer these same issues - https://techstat.net/qwen3-5-27b-q8-kv-cache-benchmarks-bf16-vs-f16-vs-q8_0/
I would love to see a response from the Google team here. I spent quite a bit of time making my original post and curating the data. I am just looking for some insight.


