Instructions to use Edmon02/gemma-4-12B-it-GGUF with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- llama-cpp-python
How to use Edmon02/gemma-4-12B-it-GGUF with llama-cpp-python:
# !pip install llama-cpp-python from llama_cpp import Llama llm = Llama.from_pretrained( repo_id="Edmon02/gemma-4-12B-it-GGUF", filename="gemma-4-12B-it-IQ2_M.gguf", )
llm.create_chat_completion( messages = [ { "role": "user", "content": [ { "type": "text", "text": "Describe this image in one sentence." }, { "type": "image_url", "image_url": { "url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg" } } ] } ] ) - Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- llama.cpp
How to use Edmon02/gemma-4-12B-it-GGUF with llama.cpp:
Install from brew
brew install llama.cpp # Start a local OpenAI-compatible server with a web UI: llama-server -hf Edmon02/gemma-4-12B-it-GGUF:Q4_K_M # Run inference directly in the terminal: llama-cli -hf Edmon02/gemma-4-12B-it-GGUF:Q4_K_M
Install from WinGet (Windows)
winget install llama.cpp # Start a local OpenAI-compatible server with a web UI: llama-server -hf Edmon02/gemma-4-12B-it-GGUF:Q4_K_M # Run inference directly in the terminal: llama-cli -hf Edmon02/gemma-4-12B-it-GGUF:Q4_K_M
Use pre-built binary
# Download pre-built binary from: # https://github.com/ggerganov/llama.cpp/releases # Start a local OpenAI-compatible server with a web UI: ./llama-server -hf Edmon02/gemma-4-12B-it-GGUF:Q4_K_M # Run inference directly in the terminal: ./llama-cli -hf Edmon02/gemma-4-12B-it-GGUF:Q4_K_M
Build from source code
git clone https://github.com/ggerganov/llama.cpp.git cd llama.cpp cmake -B build cmake --build build -j --target llama-server llama-cli # Start a local OpenAI-compatible server with a web UI: ./build/bin/llama-server -hf Edmon02/gemma-4-12B-it-GGUF:Q4_K_M # Run inference directly in the terminal: ./build/bin/llama-cli -hf Edmon02/gemma-4-12B-it-GGUF:Q4_K_M
Use Docker
docker model run hf.co/Edmon02/gemma-4-12B-it-GGUF:Q4_K_M
- LM Studio
- Jan
- vLLM
How to use Edmon02/gemma-4-12B-it-GGUF with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "Edmon02/gemma-4-12B-it-GGUF" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "Edmon02/gemma-4-12B-it-GGUF", "messages": [ { "role": "user", "content": [ { "type": "text", "text": "Describe this image in one sentence." }, { "type": "image_url", "image_url": { "url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg" } } ] } ] }'Use Docker
docker model run hf.co/Edmon02/gemma-4-12B-it-GGUF:Q4_K_M
- Ollama
How to use Edmon02/gemma-4-12B-it-GGUF with Ollama:
ollama run hf.co/Edmon02/gemma-4-12B-it-GGUF:Q4_K_M
- Unsloth Studio
How to use Edmon02/gemma-4-12B-it-GGUF with Unsloth Studio:
Install Unsloth Studio (macOS, Linux, WSL)
curl -fsSL https://unsloth.ai/install.sh | sh # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for Edmon02/gemma-4-12B-it-GGUF to start chatting
Install Unsloth Studio (Windows)
irm https://unsloth.ai/install.ps1 | iex # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for Edmon02/gemma-4-12B-it-GGUF to start chatting
Using HuggingFace Spaces for Unsloth
# No setup required # Open https://huggingface.co/spaces/unsloth/studio in your browser # Search for Edmon02/gemma-4-12B-it-GGUF to start chatting
- Pi
How to use Edmon02/gemma-4-12B-it-GGUF with Pi:
Start the llama.cpp server
# Install llama.cpp: brew install llama.cpp # Start a local OpenAI-compatible server: llama-server -hf Edmon02/gemma-4-12B-it-GGUF:Q4_K_M
Configure the model in Pi
# Install Pi: npm install -g @mariozechner/pi-coding-agent # Add to ~/.pi/agent/models.json: { "providers": { "llama-cpp": { "baseUrl": "http://localhost:8080/v1", "api": "openai-completions", "apiKey": "none", "models": [ { "id": "Edmon02/gemma-4-12B-it-GGUF:Q4_K_M" } ] } } }Run Pi
# Start Pi in your project directory: pi
- Hermes Agent new
How to use Edmon02/gemma-4-12B-it-GGUF with Hermes Agent:
Start the llama.cpp server
# Install llama.cpp: brew install llama.cpp # Start a local OpenAI-compatible server: llama-server -hf Edmon02/gemma-4-12B-it-GGUF:Q4_K_M
Configure Hermes
# Install Hermes: curl -fsSL https://hermes-agent.nousresearch.com/install.sh | bash hermes setup # Point Hermes at the local server: hermes config set model.provider custom hermes config set model.base_url http://127.0.0.1:8080/v1 hermes config set model.default Edmon02/gemma-4-12B-it-GGUF:Q4_K_M
Run Hermes
hermes
- Atomic Chat new
- Docker Model Runner
How to use Edmon02/gemma-4-12B-it-GGUF with Docker Model Runner:
docker model run hf.co/Edmon02/gemma-4-12B-it-GGUF:Q4_K_M
- Lemonade
How to use Edmon02/gemma-4-12B-it-GGUF with Lemonade:
Pull the model
# Download Lemonade from https://lemonade-server.ai/ lemonade pull Edmon02/gemma-4-12B-it-GGUF:Q4_K_M
Run and chat with the model
lemonade run user.gemma-4-12B-it-GGUF-Q4_K_M
List all available models
lemonade list
Gemma 4 12B Instruction-Tuned — GGUF (multimodal)
Community GGUF mirror of google/gemma-4-12B-it for local, encoder-free multimodal AI on consumer hardware (~16 GB VRAM).
Announced June 2026: Google blog · Developer guide
| Parameters | ~12B dense |
| Modalities | Text, vision, audio (native in backbone) |
| License | Apache 2.0 |
| Architecture | Encoder-free (no separate vision/audio towers) |
| Context | See upstream config |
| Vision in GGUF | Requires mmproj-*.gguf alongside main weights |
Why this repo exists
- One download hub for all major quants (K-quants, IQ, Q8, mmproj).
- Fast Hub-side sync from bartowski/gemma-4-12B-it-GGUF — no re-upload from your laptop.
- Documented use cases for contributors: gemma-4-12b-local (agents, LiteRT, llama.cpp, MLX).
Available files
See gguf-manifest.json for the live file list.
Essential tier (recommended)
| File | Use |
|---|---|
gemma-4-12B-it-Q4_K_M.gguf |
Best balance — 16 GB laptops |
gemma-4-12B-it-Q5_K_M.gguf |
Higher quality |
gemma-4-12B-it-Q6_K.gguf / Q8_0 |
Max quality |
gemma-4-12B-it-Q3_K_M.gguf |
Tighter VRAM |
gemma-4-12B-it-Q2_K.gguf |
Minimum size |
gemma-4-12B-it-IQ4_XS.gguf / IQ4_NL.gguf |
IQ variants |
mmproj-gemma-4-12B-it-f16.gguf |
Required for images in llama.cpp |
Full tier
All bartowski quants (Q2_K_L, Q3_K_XL, Q4_0, Q4_1, bf16, imatrix, etc.) — run make sync-gemma4-gguf-full.
Download
pip install -U huggingface_hub
# Text + vision (recommended)
huggingface-cli download Edmon02/gemma-4-12B-it-GGUF \
gemma-4-12B-it-Q4_K_M.gguf \
mmproj-gemma-4-12B-it-f16.gguf \
--local-dir ./models/gemma-4-12b
Accept the license on google/gemma-4-12B-it before using weights.
Quick start
llama.cpp (text)
llama-cli -m gemma-4-12B-it-Q4_K_M.gguf -p "Explain encoder-free multimodal models in 3 bullets." -n 256
llama.cpp (image + text)
llama-mtmd-cli \
-m gemma-4-12B-it-Q4_K_M.gguf \
--mmproj mmproj-gemma-4-12B-it-f16.gguf \
--image photo.jpg \
-p "Describe this image."
LiteRT-LM (OpenAI-compatible local server)
litert-lm import --from-huggingface-repo=litert-community/gemma-4-12B-it-litert-lm gemma-4-12B-it.litertlm gemma4-12b
litert-lm serve
LM Studio / Ollama
Import Edmon02/gemma-4-12B-it-GGUF and select Q4_K_M + mmproj.
Use cases
| Use case | Quant | Tool |
|---|---|---|
| Local coding agent | Q4_K_M | OpenCode, Continue, Aider |
| Voice + vision assistant | Q5_K_M + mmproj | Google AI Edge Gallery / Eloquent (Mac) |
| Armenian + English research | Q4_K_M | Pair with HyVoxPopuli ASR/TTS |
| Low-VRAM laptop | Q3_K_M or IQ4_XS | llama.cpp |
| Fast inference | MTP drafter (upstream) | Google checkpoint + compatible runtime |
Hardware guide
| VRAM | Suggested files |
|---|---|
| 8 GB | IQ4_XS or Q3_K_M (text only, short context) |
| 16 GB | Q4_K_M + mmproj |
| 24 GB+ | Q6_K or Q8_0 + mmproj |
Provenance
| Item | Source |
|---|---|
| Base model | google/gemma-4-12B-it |
| GGUF quants | Mirrored from bartowski/gemma-4-12B-it-GGUF |
| Maintainer scripts | Edmon02/audio_set scripts/sync_gemma4_gguf_quants.py |
Limitations
- Community quants — validate quality on your tasks vs official BF16.
- Audio in GGUF may require latest llama.cpp / LM Studio builds.
- Gated upstream — HF token + license acceptance required for
google/*repos.
Contributing
Add recipes under projects/gemma-4-12b-local/examples/. See CONTRIBUTING.md in that folder.
Citation
@article{gemma_2026,
title={Gemma 4},
author={Google DeepMind},
year={2026},
url={https://blog.google/innovation-and-ai/technology/developers-tools/introducing-gemma-4-12B/}
}
- Downloads last month
- 5,028