Hugging Face | GitHub | Launch Blog | Documentation
License: Apache 2.0 | Authors: Google DeepMind

Gemma 4-E4B

This is a custom quantized version of the Gemma 4-E4B model, quantized to Q4_0 with custom OVERRIDE file. It is designed to achieve fast inference on Qualcomm Hexagon NPU while maintaining adequate accuracy.

how model is generated

Built with llama.cpp commit 7c158fb.

Three steps, run from the repo root.

Step 1 — download the unquantized HF model

hf download google/gemma-4-E4B-it-qat-q4_0-unquantized --local-dir ./hf-model

Step 2 — convert HF → F16 GGUF

convert_hf_to_gguf.py ./hf-model --outfile model-f16.gguf --outtype f16

Step 3 — follow the OVERRIDE file and quantize to Q4_0

build/bin/llama-quantize --tensor-type-file <override-file> \
    model-f16.gguf model-q4_0-override.gguf q4_0

Performance Measurement Commands

CPU uses --device none -ngl 0; HTP uses --device HTP0 -ngl 99. For each (model, backend, CTX ∈ {512, 1024, 4096}) two llama-bench runs were issued — one for prefill, one for decode:

# environment on device
export LD_LIBRARY_PATH=./lib
export ADSP_LIBRARY_PATH=./lib

# Prefill (Prefill tok/s; TTFT = CTX / Prefill × 1000)
./bin/llama-bench --device <none|HTP0> -m <model.gguf> \
  --poll 1000 -t 6 --cpu-mask 0xfc --cpu-strict 1 --ubatch-size 1024 -fa on \
  -ngl <0|99> -p <CTX> -n 0

# Decode at depth = CTX (Decode tok/s)
./bin/llama-bench --device <none|HTP0> -m <model.gguf> \
  --poll 1000 -t 6 --cpu-mask 0xfc --cpu-strict 1 --ubatch-size 1024 -fa on \
  -ngl <0|99> -p 0 -n 128 -d <CTX>

Performance Metrics

Performance on IQ9 (QCS9075M)

Compute CTX Unsloth PTQ GGUF Google QAT GGUF Ours
CPU 512 40.4 / 11.93 54.7 / 12.37 53.5 / 12.97
CPU 1024 38.7 / 11.58 51.6 / 11.93 50.9 / 12.55
CPU 4096 34.4 / 9.45 44.2 / 9.75 44.3 / 10.14
HTP 512 149.9 / 11.32 357.1 / 9.82 355.9 / 10.57
HTP 1024 147.8 / 11.16 346.6 / 9.73 345.3 / 10.41
HTP 4096 143.4 / 10.70 322.8 / 9.41 321.5 / 10.05

Accuracy Metrics

The MMLU-Pro is measured:

Subject Unsloth PTQ GGUF Google QAT GGUF Ours
mmlu_pro 0.5711 0.5940 0.5761
biology 0.7531 ± 0.0161 0.8173 ± 0.0144 0.7671 ± 0.0158
business 0.6261 ± 0.0172 0.6527 ± 0.0170 0.6274 ± 0.0172
chemistry 0.6148 ± 0.0145 0.6396 ± 0.0143 0.6140 ± 0.0145
computer_science 0.6829 ± 0.0230 0.6976 ± 0.0227 0.7000 ± 0.0227
economics 0.6836 ± 0.0160 0.7002 ± 0.0158 0.6896 ± 0.0159
engineering 0.4221 ± 0.0159 0.4221 ± 0.0159 0.4138 ± 0.0158
health 0.5672 ± 0.0173 0.5611 ± 0.0174 0.5770 ± 0.0173
history 0.3780 ± 0.0249 0.4724 ± 0.0256 0.4226 ± 0.0253
law 0.3224 ± 0.0141 0.2997 ± 0.0138 0.3079 ± 0.0139
math 0.8194 ± 0.0105 0.7972 ± 0.0109 0.8142 ± 0.0106
other 0.4416 ± 0.0163 0.4935 ± 0.0165 0.4481 ± 0.0164
philosophy 0.4890 ± 0.0224 0.4729 ± 0.0224 0.4770 ± 0.0224
physics 0.6105 ± 0.0135 0.6097 ± 0.0135 0.6305 ± 0.0134
psychology 0.5840 ± 0.0175 0.6805 ± 0.0165 0.5764 ± 0.0175

License

Apache 2.0

Downloads last month
1,716
GGUF
Model size
7B params
Architecture
gemma4
Hardware compatibility
Log In to add your hardware

4-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for zackliqcom/gemma4-E4B-Q40-custom

Quantized
(28)
this model