Atomic Chat Join Discord GitHub

Gemma 4 E4B

Gemma 4 E4B, self-quantized to GGUF by Atomic Chat. Built straight from Google's original weights with a per-tensor importance matrix. Runs fully offline.

Highlights

  • Natively multimodal — handles text, image, and audio input and generates text output.
  • 4.5B effective parameters (8B with embeddings) — the "E" stands for "effective", using Per-Layer Embeddings (PLE) for on-device efficiency.
  • 128K-token context window built on a hybrid local/global attention mechanism.
  • Built-in thinking mode — configurable step-by-step reasoning, triggered with the <|think|> token.
  • Native function calling for structured tool use and agentic workflows.
  • Multilingual — out-of-the-box support for 35+ languages, pre-trained on 140+ languages.

These GGUFs are self-quantized from the original weights, not a repack. The importance matrix keeps low-bit quants closer to the full-precision model.

Always pass --jinja so the Gemma 4 E4B chat template is applied. Without it the model can emit malformed turns.

Model Overview

Property Value
Base model google/gemma-4-E4B-it
Parameters 4.5B effective (8B with embeddings); uses Per-Layer Embeddings (PLE)
Layers 42
Context length 128K tokens
Vocabulary 262K
Modalities Text, Image, Audio
Architecture Dense, hybrid local sliding-window (512) + global attention with p-RoPE
This repo GGUF quants (imatrix) + vision mmproj

Gemma 4 E4B is multimodal. This repo ships the mmproj-gemma4-e4b-it-f16.gguf vision projector. With -hf it is pulled automatically; otherwise pass --mmproj. Use llama-mtmd-cli or llama-server to feed images.

Gemma 4 E4B benchmark scores

Scores are Google's published results for the base google/gemma-4-E4B-it. Quantization preserves the large majority of this; Q4_K_M and up sit within a point or two of full precision.

Choosing a quant

Quant Size Notes
Q2_K 4.4 GB Smallest. Minimal RAM, clear quality drop.
IQ3_M 4.7 GB Beats Q3 at similar size thanks to imatrix. Best low-RAM pick.
Q3_K_M 4.9 GB Low quality but usable.
Q3_K_L 5.0 GB A step above Q3_K_M.
IQ4_XS 5.1 GB Excellent quality for size. Recommended low-bit.
Q4_K_S 5.2 GB Compact Q4, fast.
Q4_K_M 5.3 GB Recommended default. Best balance of size, speed and quality.
UD-Q4_K_XL 6.2 GB Dynamic. Embeddings and output kept at Q8_0 for higher quality at a Q4 footprint.
Q5_K_S 5.7 GB Higher quality.
Q5_K_M 5.8 GB Higher quality, low loss.
Q6_K 6.2 GB Near lossless.
Q8_0 8.0 GB Effectively lossless, reference quality.

Pick the largest file that fits your (V)RAM with room for context. Q4_K_M or UD-Q4_K_XL is the sweet spot for most setups; Q6_K or Q8_0 for maximum fidelity.

Get started

Run Gemma 4 E4B locally with:

  • Atomic Chat: the easiest path. Open the app, search AlexAtomic/gemma4-e4b-it-GGUF, pick a quant, hit Use this model.
  • llama.cpp: llama-server -hf AlexAtomic/gemma4-e4b-it-GGUF:Q4_K_M --jinja -c 8192
  • Ollama: ollama run hf.co/AlexAtomic/gemma4-e4b-it-GGUF:Q4_K_M
  • LM Studio / Jan: search the repo id, download any quant.

Best practices

Parameter Value
temperature 1.0
top_p 0.95
top_k 64

Google's standardized sampling configuration recommended across all use cases.

Run in llama.cpp

git clone https://github.com/ggerganov/llama.cpp
cmake llama.cpp -B llama.cpp/build -DBUILD_SHARED_LIBS=OFF -DGGML_CUDA=ON
cmake --build llama.cpp/build --config Release -j --target llama-cli llama-server
./llama.cpp/build/bin/llama-server \
    -hf AlexAtomic/gemma4-e4b-it-GGUF:UD-Q4_K_XL \
    --jinja -ngl 99 -c 8192 -fa on

How these were made

  1. Download google/gemma-4-E4B-it (original weights).
  2. Convert to f16 GGUF with llama.cpp.
  3. Build an importance matrix over calibration_datav3 (100 chunks).
  4. Quantize the full ladder with --imatrix.
  5. UD-Q4_K_XL additionally pins the token-embedding and output tensors to Q8_0.

License

Original model by Google DeepMind, released under the Apache 2.0 license. Quantized by Atomic Chat.

Downloads last month
2,904
GGUF
Model size
8B params
Architecture
gemma4
Hardware compatibility
Log In to add your hardware

2-bit

3-bit

4-bit

5-bit

6-bit

8-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for AlexAtomic/gemma4-e4b-it-GGUF

Quantized
(241)
this model