Instructions to use AtomicChat/gemma-4-12b-it-GGUF with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use AtomicChat/gemma-4-12b-it-GGUF with llama-cpp-python:

# !pip install llama-cpp-python

from llama_cpp import Llama

llm = Llama.from_pretrained(
	repo_id="AtomicChat/gemma-4-12b-it-GGUF",
	filename="atomic-chat-gemma412-IQ3_M.gguf",
)

llm.create_chat_completion(
	messages = [
		{
			"role": "user",
			"content": [
				{
					"type": "text",
					"text": "Describe this image in one sentence."
				},
				{
					"type": "image_url",
					"image_url": {
						"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
					}
				}
			]
		}
	]
)

Notebooks
Google Colab
Kaggle
Local Apps Settings

llama.cpp

How to use AtomicChat/gemma-4-12b-it-GGUF with llama.cpp:

Install from brew

brew install llama.cpp
# Start a local OpenAI-compatible server with a web UI:
llama-server -hf AtomicChat/gemma-4-12b-it-GGUF:UD-Q4_K_XL
# Run inference directly in the terminal:
llama-cli -hf AtomicChat/gemma-4-12b-it-GGUF:UD-Q4_K_XL

Install from WinGet (Windows)

winget install llama.cpp
# Start a local OpenAI-compatible server with a web UI:
llama-server -hf AtomicChat/gemma-4-12b-it-GGUF:UD-Q4_K_XL
# Run inference directly in the terminal:
llama-cli -hf AtomicChat/gemma-4-12b-it-GGUF:UD-Q4_K_XL

Use pre-built binary

# Download pre-built binary from:
# https://github.com/ggerganov/llama.cpp/releases
# Start a local OpenAI-compatible server with a web UI:
./llama-server -hf AtomicChat/gemma-4-12b-it-GGUF:UD-Q4_K_XL
# Run inference directly in the terminal:
./llama-cli -hf AtomicChat/gemma-4-12b-it-GGUF:UD-Q4_K_XL

Build from source code

git clone https://github.com/ggerganov/llama.cpp.git
cd llama.cpp
cmake -B build
cmake --build build -j --target llama-server llama-cli
# Start a local OpenAI-compatible server with a web UI:
./build/bin/llama-server -hf AtomicChat/gemma-4-12b-it-GGUF:UD-Q4_K_XL
# Run inference directly in the terminal:
./build/bin/llama-cli -hf AtomicChat/gemma-4-12b-it-GGUF:UD-Q4_K_XL

Use Docker

docker model run hf.co/AtomicChat/gemma-4-12b-it-GGUF:UD-Q4_K_XL

LM Studio
Jan

vLLM

How to use AtomicChat/gemma-4-12b-it-GGUF with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "AtomicChat/gemma-4-12b-it-GGUF"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "AtomicChat/gemma-4-12b-it-GGUF",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Use Docker

docker model run hf.co/AtomicChat/gemma-4-12b-it-GGUF:UD-Q4_K_XL

Ollama
How to use AtomicChat/gemma-4-12b-it-GGUF with Ollama:
```
ollama run hf.co/AtomicChat/gemma-4-12b-it-GGUF:UD-Q4_K_XL
```

Unsloth Studio

How to use AtomicChat/gemma-4-12b-it-GGUF with Unsloth Studio:

Install Unsloth Studio (macOS, Linux, WSL)

curl -fsSL https://unsloth.ai/install.sh | sh
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for AtomicChat/gemma-4-12b-it-GGUF to start chatting

Install Unsloth Studio (Windows)

irm https://unsloth.ai/install.ps1 | iex
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for AtomicChat/gemma-4-12b-it-GGUF to start chatting

Using HuggingFace Spaces for Unsloth

# No setup required
# Open https://huggingface.co/spaces/unsloth/studio in your browser
# Search for AtomicChat/gemma-4-12b-it-GGUF to start chatting

How to use AtomicChat/gemma-4-12b-it-GGUF with Pi:

Start the llama.cpp server

# Install llama.cpp:
brew install llama.cpp
# Start a local OpenAI-compatible server:
llama-server -hf AtomicChat/gemma-4-12b-it-GGUF:UD-Q4_K_XL

Configure the model in Pi

# Install Pi:
npm install -g @mariozechner/pi-coding-agent
# Add to ~/.pi/agent/models.json:
{
  "providers": {
    "llama-cpp": {
      "baseUrl": "http://localhost:8080/v1",
      "api": "openai-completions",
      "apiKey": "none",
      "models": [
        {
          "id": "AtomicChat/gemma-4-12b-it-GGUF:UD-Q4_K_XL"
        }
      ]
    }
  }
}

Run Pi

# Start Pi in your project directory:
pi

Hermes Agent new

How to use AtomicChat/gemma-4-12b-it-GGUF with Hermes Agent:

Start the llama.cpp server

# Install llama.cpp:
brew install llama.cpp
# Start a local OpenAI-compatible server:
llama-server -hf AtomicChat/gemma-4-12b-it-GGUF:UD-Q4_K_XL

Configure Hermes

# Install Hermes:
curl -fsSL https://hermes-agent.nousresearch.com/install.sh | bash
hermes setup
# Point Hermes at the local server:
hermes config set model.provider custom
hermes config set model.base_url http://127.0.0.1:8080/v1
hermes config set model.default AtomicChat/gemma-4-12b-it-GGUF:UD-Q4_K_XL

Run Hermes

hermes

Atomic Chat new
Docker Model Runner
How to use AtomicChat/gemma-4-12b-it-GGUF with Docker Model Runner:
```
docker model run hf.co/AtomicChat/gemma-4-12b-it-GGUF:UD-Q4_K_XL
```

Lemonade

How to use AtomicChat/gemma-4-12b-it-GGUF with Lemonade:

Pull the model

# Download Lemonade from https://lemonade-server.ai/
lemonade pull AtomicChat/gemma-4-12b-it-GGUF:UD-Q4_K_XL

Run and chat with the model

lemonade run user.gemma-4-12b-it-GGUF-UD-Q4_K_XL

List all available models

lemonade list

Base model: google/gemma-4-12b-it

Gemma 4 12B, self-quantized to GGUF by Atomic Chat. Built straight from Google's original bf16 weights with a per-tensor importance matrix, so every file stays close to full precision. Runs fully offline.

Highlights

Gemma 4 12B is Google DeepMind's encoder-free model that projects raw inputs straight into the LLM embedding space. It punches well above its size on reasoning, code and long context while staying small enough for a laptop.

Reasoning and code at a level usually reserved for much larger models.
256K context for long documents and codebases.
Full quant ladder from Q2_K to Q8_0, plus a dynamic UD-Q4_K_XL.
Importance matrix on every quant, computed over the standard calibration_datav3 corpus, so low-bit files lose far less quality.
Open weights, fully offline through Atomic Chat, llama.cpp, Ollama, LM Studio or Jan.

These GGUFs are self-quantized from Google's original bf16 weights, not a repack. The importance matrix keeps low-bit quants closer to the full-precision model.

Always pass --jinja so the Gemma 4 chat template is applied. Without it the model can emit malformed turns.

Model Overview

Property	Value
Base model	`google/gemma-4-12b-it`
Total parameters	11.95B
Layers	48
Context length	256K (262,144)
Vocabulary	262K
Architecture	`gemma4`
This repo	GGUF quants (imatrix) + vision/audio mmproj

Gemma 4 is natively multimodal (text, image, audio). This repo ships the mmproj-gemma4-12b-f16.gguf projector for vision and audio. With -hf the projector is pulled automatically; otherwise pass it via --mmproj. Use llama-mtmd-cli or llama-server to feed images and audio.

Scores are Google's published results for the base gemma-4-12b-it. Quantization preserves the large majority of this; Q4_K_M and up sit within a point or two of full precision.

Choosing a quant

Quant	Size	Notes
`Q2_K`	4.5 GB	Smallest. Minimal RAM, clear quality drop.
`IQ3_M`	5.4 GB	Beats Q3 at similar size thanks to imatrix. Best low-RAM pick.
`Q3_K_M`	5.7 GB	Low quality but usable.
`Q3_K_L`	6.2 GB	A step above `Q3_K_M`.
`IQ4_XS`	6.2 GB	Excellent quality for size. Recommended low-bit.
`Q4_K_S`	6.6 GB	Compact Q4, fast.
`Q4_K_M`	6.9 GB	Recommended default. Best balance of size, speed and quality.
`UD-Q4_K_XL`	7.2 GB	Dynamic. Embeddings and output kept at `Q8_0` for higher quality at a Q4 footprint.
`Q5_K_S`	7.1 GB	Higher quality.
`Q5_K_M`	8.0 GB	Higher quality, low loss.
`Q6_K`	9.2 GB	Near lossless.
`Q8_0`	12.0 GB	Effectively lossless, reference quality.

Pick the largest file that fits your (V)RAM with room for context. Q4_K_M or UD-Q4_K_XL is the sweet spot for most setups; Q6_K or Q8_0 for maximum fidelity.

Get started

Run Gemma 4 12B locally with:

Atomic Chat: the easiest path. Open the app, search AtomicChat/gemma-4-12b-it-GGUF, pick a quant, hit Use this model.
llama.cpp: llama-server -hf AtomicChat/gemma-4-12b-it-GGUF:Q4_K_M --jinja -c 8192 (build steps in the section below).
Ollama: ollama run hf.co/AtomicChat/gemma-4-12b-it-GGUF:Q4_K_M
LM Studio: search the repo id, download any quant.
Jan: search the repo id, download any quant.

Best practices

Gemma 4 works well with its standard sampling defaults:

Parameter	Value
temperature	1.0
top_k	64
top_p	0.95
min_p	0.0
repeat_penalty	1.0

Drop temperature to 0.6 or 0.7 for code and math where you want determinism.

Run in llama.cpp

Build llama.cpp, then point llama-server straight at this repo:

apt-get update
apt-get install build-essential cmake curl libcurl4-openssl-dev -y
git clone https://github.com/ggerganov/llama.cpp
cmake llama.cpp -B llama.cpp/build -DBUILD_SHARED_LIBS=OFF -DGGML_CUDA=ON
cmake --build llama.cpp/build --config Release -j --clean-first --target llama-cli llama-server llama-gguf-split
cp llama.cpp/build/bin/llama-* llama.cpp

./llama.cpp/llama-server \
    -hf AtomicChat/gemma-4-12b-it-GGUF:UD-Q4_K_XL \
    --jinja -ngl 99 -c 8192 -fa on

Set -DGGML_CUDA=OFF for CPU or Metal builds.

How these were made

Download google/gemma-4-12b-it (bf16).
Convert to f16 GGUF with llama.cpp.
Build an importance matrix over calibration_datav3 (100 chunks).
Quantize the full ladder with --imatrix.
UD-Q4_K_XL additionally pins the token-embedding and output tensors to Q8_0.

License

These weights are derived from Gemma and stay governed by the Gemma Terms of Use. By downloading you agree to those terms. Original model by Google DeepMind. Quantized by Atomic Chat.

Downloads last month: 1,415

GGUF

Model size

12B params

Architecture

gemma4

Hardware compatibility

2-bit

3-bit

4-bit

5-bit

6-bit

8-bit

Collection including AtomicChat/gemma-4-12b-it-GGUF

Gemma 4

Collection

Atomic Chat GGUF builds of Google Gemma 4 • 1 item • Updated 5 days ago • 1