Gemma 4 31B, self-quantized to GGUF by Atomic Chat. Built straight from Google's original weights with a per-tensor importance matrix. Runs fully offline.
Highlights
- Multimodal — natively handles text and image input and generates text output.
- Built-in reasoning — designed as a capable reasoner with a configurable thinking mode set via the system prompt.
- 256K context window for long documents and codebases.
- Multilingual — out-of-the-box support for 35+ languages, pre-trained on 140+ languages.
- Coding & agentic — native function-calling support and notable coding-benchmark improvements.
- Native system prompt — native support for the
systemrole for more structured conversations.
These GGUFs are self-quantized from the original weights, not a repack. The importance matrix keeps low-bit quants closer to the full-precision model.
Always pass
--jinjaso the Gemma 4 31B chat template is applied. Without it the model can emit malformed turns.
Model Overview
| Property | Value |
|---|---|
| Base model | google/gemma-4-31B-it |
| Total parameters | 30.7B (Dense) |
| Layers | 60 |
| Context length | 256K tokens |
| Vocabulary | 262K |
| Architecture | Dense decoder, hybrid local/global attention with Proportional RoPE |
| This repo | GGUF quants (imatrix) + vision mmproj |
Gemma 4 31B is multimodal. This repo ships the
mmproj-gemma4-31b-it-f16.ggufvision projector. With-hfit is pulled automatically; otherwise pass--mmproj. Usellama-mtmd-cliorllama-serverto feed images.
Scores are Google's published results for the base google/gemma-4-31B-it. Quantization preserves the large majority of this; Q4_K_M and up sit within a point or two of full precision.
Choosing a quant
| Quant | Size | Notes |
|---|---|---|
Q2_K |
11.9 GB | Smallest. Minimal RAM, clear quality drop. |
IQ3_M |
14.2 GB | Beats Q3 at similar size thanks to imatrix. Best low-RAM pick. |
Q3_K_M |
15.3 GB | Low quality but usable. |
Q3_K_L |
16.6 GB | A step above Q3_K_M. |
IQ4_XS |
16.7 GB | Excellent quality for size. Recommended low-bit. |
Q4_K_S |
17.8 GB | Compact Q4, fast. |
Q4_K_M |
18.7 GB | Recommended default. Best balance of size, speed and quality. |
UD-Q4_K_XL |
19.0 GB | Dynamic. Embeddings and output kept at Q8_0 for higher quality at a Q4 footprint. |
Q5_K_S |
19.6 GB | Higher quality. |
Q5_K_M |
17.4 GB | Higher quality, low loss. |
Q6_K |
7.6 GB | Near lossless. |
Q8_0 |
18.4 GB | Effectively lossless, reference quality. |
Pick the largest file that fits your (V)RAM with room for context.
Q4_K_MorUD-Q4_K_XLis the sweet spot for most setups;Q6_KorQ8_0for maximum fidelity.
Get started
Run Gemma 4 31B locally with:
- Atomic Chat: the easiest path. Open the app, search
AtomicChat/gemma4-31b-it-GGUF, pick a quant, hit Use this model. - llama.cpp:
llama-server -hf AtomicChat/gemma4-31b-it-GGUF:Q4_K_M --jinja -c 8192 - Ollama:
ollama run hf.co/AtomicChat/gemma4-31b-it-GGUF:Q4_K_M - LM Studio / Jan: search the repo id, download any quant.
Best practices
| Parameter | Value |
|---|---|
| temperature | 1.0 |
| top_p | 0.95 |
| top_k | 64 |
Google's standardized sampling configuration recommended across all use cases.
Run in llama.cpp
git clone https://github.com/ggerganov/llama.cpp
cmake llama.cpp -B llama.cpp/build -DBUILD_SHARED_LIBS=OFF -DGGML_CUDA=ON
cmake --build llama.cpp/build --config Release -j --target llama-cli llama-server
./llama.cpp/build/bin/llama-server \
-hf AtomicChat/gemma4-31b-it-GGUF:UD-Q4_K_XL \
--jinja -ngl 99 -c 8192 -fa on
How these were made
- Download
google/gemma-4-31B-it(original weights). - Convert to f16 GGUF with llama.cpp.
- Build an importance matrix over
calibration_datav3(100 chunks). - Quantize the full ladder with
--imatrix. UD-Q4_K_XLadditionally pins the token-embedding and output tensors toQ8_0.
License
Original model by Google DeepMind, released under the Apache 2.0 license. Quantized by Atomic Chat.
- Downloads last month
- 1,270
2-bit
3-bit
4-bit
5-bit
6-bit
8-bit


