Instructions to use AtomicChat/gemma-4-12b-it-GGUF with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- llama-cpp-python
How to use AtomicChat/gemma-4-12b-it-GGUF with llama-cpp-python:
# !pip install llama-cpp-python from llama_cpp import Llama llm = Llama.from_pretrained( repo_id="AtomicChat/gemma-4-12b-it-GGUF", filename="atomic-chat-gemma412-IQ3_M.gguf", )
llm.create_chat_completion( messages = [ { "role": "user", "content": [ { "type": "text", "text": "Describe this image in one sentence." }, { "type": "image_url", "image_url": { "url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg" } } ] } ] ) - Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- llama.cpp
How to use AtomicChat/gemma-4-12b-it-GGUF with llama.cpp:
Install from brew
brew install llama.cpp # Start a local OpenAI-compatible server with a web UI: llama-server -hf AtomicChat/gemma-4-12b-it-GGUF:UD-Q4_K_XL # Run inference directly in the terminal: llama-cli -hf AtomicChat/gemma-4-12b-it-GGUF:UD-Q4_K_XL
Install from WinGet (Windows)
winget install llama.cpp # Start a local OpenAI-compatible server with a web UI: llama-server -hf AtomicChat/gemma-4-12b-it-GGUF:UD-Q4_K_XL # Run inference directly in the terminal: llama-cli -hf AtomicChat/gemma-4-12b-it-GGUF:UD-Q4_K_XL
Use pre-built binary
# Download pre-built binary from: # https://github.com/ggerganov/llama.cpp/releases # Start a local OpenAI-compatible server with a web UI: ./llama-server -hf AtomicChat/gemma-4-12b-it-GGUF:UD-Q4_K_XL # Run inference directly in the terminal: ./llama-cli -hf AtomicChat/gemma-4-12b-it-GGUF:UD-Q4_K_XL
Build from source code
git clone https://github.com/ggerganov/llama.cpp.git cd llama.cpp cmake -B build cmake --build build -j --target llama-server llama-cli # Start a local OpenAI-compatible server with a web UI: ./build/bin/llama-server -hf AtomicChat/gemma-4-12b-it-GGUF:UD-Q4_K_XL # Run inference directly in the terminal: ./build/bin/llama-cli -hf AtomicChat/gemma-4-12b-it-GGUF:UD-Q4_K_XL
Use Docker
docker model run hf.co/AtomicChat/gemma-4-12b-it-GGUF:UD-Q4_K_XL
- LM Studio
- Jan
- vLLM
How to use AtomicChat/gemma-4-12b-it-GGUF with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "AtomicChat/gemma-4-12b-it-GGUF" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "AtomicChat/gemma-4-12b-it-GGUF", "messages": [ { "role": "user", "content": [ { "type": "text", "text": "Describe this image in one sentence." }, { "type": "image_url", "image_url": { "url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg" } } ] } ] }'Use Docker
docker model run hf.co/AtomicChat/gemma-4-12b-it-GGUF:UD-Q4_K_XL
- Ollama
How to use AtomicChat/gemma-4-12b-it-GGUF with Ollama:
ollama run hf.co/AtomicChat/gemma-4-12b-it-GGUF:UD-Q4_K_XL
- Unsloth Studio
How to use AtomicChat/gemma-4-12b-it-GGUF with Unsloth Studio:
Install Unsloth Studio (macOS, Linux, WSL)
curl -fsSL https://unsloth.ai/install.sh | sh # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for AtomicChat/gemma-4-12b-it-GGUF to start chatting
Install Unsloth Studio (Windows)
irm https://unsloth.ai/install.ps1 | iex # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for AtomicChat/gemma-4-12b-it-GGUF to start chatting
Using HuggingFace Spaces for Unsloth
# No setup required # Open https://huggingface.co/spaces/unsloth/studio in your browser # Search for AtomicChat/gemma-4-12b-it-GGUF to start chatting
- Pi
How to use AtomicChat/gemma-4-12b-it-GGUF with Pi:
Start the llama.cpp server
# Install llama.cpp: brew install llama.cpp # Start a local OpenAI-compatible server: llama-server -hf AtomicChat/gemma-4-12b-it-GGUF:UD-Q4_K_XL
Configure the model in Pi
# Install Pi: npm install -g @mariozechner/pi-coding-agent # Add to ~/.pi/agent/models.json: { "providers": { "llama-cpp": { "baseUrl": "http://localhost:8080/v1", "api": "openai-completions", "apiKey": "none", "models": [ { "id": "AtomicChat/gemma-4-12b-it-GGUF:UD-Q4_K_XL" } ] } } }Run Pi
# Start Pi in your project directory: pi
- Hermes Agent new
How to use AtomicChat/gemma-4-12b-it-GGUF with Hermes Agent:
Start the llama.cpp server
# Install llama.cpp: brew install llama.cpp # Start a local OpenAI-compatible server: llama-server -hf AtomicChat/gemma-4-12b-it-GGUF:UD-Q4_K_XL
Configure Hermes
# Install Hermes: curl -fsSL https://hermes-agent.nousresearch.com/install.sh | bash hermes setup # Point Hermes at the local server: hermes config set model.provider custom hermes config set model.base_url http://127.0.0.1:8080/v1 hermes config set model.default AtomicChat/gemma-4-12b-it-GGUF:UD-Q4_K_XL
Run Hermes
hermes
- Atomic Chat new
- Docker Model Runner
How to use AtomicChat/gemma-4-12b-it-GGUF with Docker Model Runner:
docker model run hf.co/AtomicChat/gemma-4-12b-it-GGUF:UD-Q4_K_XL
- Lemonade
How to use AtomicChat/gemma-4-12b-it-GGUF with Lemonade:
Pull the model
# Download Lemonade from https://lemonade-server.ai/ lemonade pull AtomicChat/gemma-4-12b-it-GGUF:UD-Q4_K_XL
Run and chat with the model
lemonade run user.gemma-4-12b-it-GGUF-UD-Q4_K_XL
List all available models
lemonade list
Gemma 4 12B, self-quantized to GGUF by Atomic Chat. Built straight from Google's original bf16 weights with a per-tensor importance matrix, so every file stays close to full precision. Runs fully offline.
Highlights
Gemma 4 12B is Google DeepMind's encoder-free model that projects raw inputs straight into the LLM embedding space. It punches well above its size on reasoning, code and long context while staying small enough for a laptop.
- Reasoning and code at a level usually reserved for much larger models.
- 256K context for long documents and codebases.
- Full quant ladder from
Q2_KtoQ8_0, plus a dynamicUD-Q4_K_XL. - Importance matrix on every quant, computed over the standard
calibration_datav3corpus, so low-bit files lose far less quality. - Open weights, fully offline through Atomic Chat, llama.cpp, Ollama, LM Studio or Jan.
These GGUFs are self-quantized from Google's original bf16 weights, not a repack. The importance matrix keeps low-bit quants closer to the full-precision model.
Always pass
--jinjaso the Gemma 4 chat template is applied. Without it the model can emit malformed turns.
Model Overview
| Property | Value |
|---|---|
| Base model | google/gemma-4-12b-it |
| Total parameters | 11.95B |
| Layers | 48 |
| Context length | 256K (262,144) |
| Vocabulary | 262K |
| Architecture | gemma4 |
| This repo | GGUF quants (imatrix) + vision/audio mmproj |
Gemma 4 is natively multimodal (text, image, audio). This repo ships the
mmproj-gemma4-12b-f16.ggufprojector for vision and audio. With-hfthe projector is pulled automatically; otherwise pass it via--mmproj. Usellama-mtmd-cliorllama-serverto feed images and audio.
Scores are Google's published results for the base gemma-4-12b-it. Quantization preserves the large majority of this; Q4_K_M and up sit within a point or two of full precision.
Choosing a quant
| Quant | Size | Notes |
|---|---|---|
Q2_K |
4.5 GB | Smallest. Minimal RAM, clear quality drop. |
IQ3_M |
5.4 GB | Beats Q3 at similar size thanks to imatrix. Best low-RAM pick. |
Q3_K_M |
5.7 GB | Low quality but usable. |
Q3_K_L |
6.2 GB | A step above Q3_K_M. |
IQ4_XS |
6.2 GB | Excellent quality for size. Recommended low-bit. |
Q4_K_S |
6.6 GB | Compact Q4, fast. |
Q4_K_M |
6.9 GB | Recommended default. Best balance of size, speed and quality. |
UD-Q4_K_XL |
7.2 GB | Dynamic. Embeddings and output kept at Q8_0 for higher quality at a Q4 footprint. |
Q5_K_S |
7.1 GB | Higher quality. |
Q5_K_M |
8.0 GB | Higher quality, low loss. |
Q6_K |
9.2 GB | Near lossless. |
Q8_0 |
12.0 GB | Effectively lossless, reference quality. |
Pick the largest file that fits your (V)RAM with room for context.
Q4_K_MorUD-Q4_K_XLis the sweet spot for most setups;Q6_KorQ8_0for maximum fidelity.
Get started
Run Gemma 4 12B locally with:
- Atomic Chat: the easiest path. Open the app, search
AtomicChat/gemma-4-12b-it-GGUF, pick a quant, hit Use this model. - llama.cpp:
llama-server -hf AtomicChat/gemma-4-12b-it-GGUF:Q4_K_M --jinja -c 8192(build steps in the section below). - Ollama:
ollama run hf.co/AtomicChat/gemma-4-12b-it-GGUF:Q4_K_M - LM Studio: search the repo id, download any quant.
- Jan: search the repo id, download any quant.
Best practices
Gemma 4 works well with its standard sampling defaults:
| Parameter | Value |
|---|---|
| temperature | 1.0 |
| top_k | 64 |
| top_p | 0.95 |
| min_p | 0.0 |
| repeat_penalty | 1.0 |
Drop temperature to 0.6 or 0.7 for code and math where you want determinism.
Run in llama.cpp
Build llama.cpp, then point llama-server straight at this repo:
apt-get update
apt-get install build-essential cmake curl libcurl4-openssl-dev -y
git clone https://github.com/ggerganov/llama.cpp
cmake llama.cpp -B llama.cpp/build -DBUILD_SHARED_LIBS=OFF -DGGML_CUDA=ON
cmake --build llama.cpp/build --config Release -j --clean-first --target llama-cli llama-server llama-gguf-split
cp llama.cpp/build/bin/llama-* llama.cpp
./llama.cpp/llama-server \
-hf AtomicChat/gemma-4-12b-it-GGUF:UD-Q4_K_XL \
--jinja -ngl 99 -c 8192 -fa on
Set -DGGML_CUDA=OFF for CPU or Metal builds.
How these were made
- Download
google/gemma-4-12b-it(bf16). - Convert to f16 GGUF with llama.cpp.
- Build an importance matrix over
calibration_datav3(100 chunks). - Quantize the full ladder with
--imatrix. UD-Q4_K_XLadditionally pins the token-embedding and output tensors toQ8_0.
License
These weights are derived from Gemma and stay governed by the Gemma Terms of Use. By downloading you agree to those terms. Original model by Google DeepMind. Quantized by Atomic Chat.
- Downloads last month
- 1,415
2-bit
3-bit
4-bit
5-bit
6-bit
8-bit


