Instructions to use Dhptl/gemma-4-12B-GGUF with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use Dhptl/gemma-4-12B-GGUF with llama-cpp-python:

# !pip install llama-cpp-python

from llama_cpp import Llama

llm = Llama.from_pretrained(
	repo_id="Dhptl/gemma-4-12B-GGUF",
	filename="gemma-4-12B-IQ4_XS.gguf",
)

output = llm(
	"Once upon a time,",
	max_tokens=512,
	echo=True
)
print(output)

Notebooks
Google Colab
Kaggle
Local Apps Settings

llama.cpp

How to use Dhptl/gemma-4-12B-GGUF with llama.cpp:

Install from brew

brew install llama.cpp
# Start a local OpenAI-compatible server with a web UI:
llama-server -hf Dhptl/gemma-4-12B-GGUF:Q4_K_M
# Run inference directly in the terminal:
llama-cli -hf Dhptl/gemma-4-12B-GGUF:Q4_K_M

Install from WinGet (Windows)

winget install llama.cpp
# Start a local OpenAI-compatible server with a web UI:
llama-server -hf Dhptl/gemma-4-12B-GGUF:Q4_K_M
# Run inference directly in the terminal:
llama-cli -hf Dhptl/gemma-4-12B-GGUF:Q4_K_M

Use pre-built binary

# Download pre-built binary from:
# https://github.com/ggerganov/llama.cpp/releases
# Start a local OpenAI-compatible server with a web UI:
./llama-server -hf Dhptl/gemma-4-12B-GGUF:Q4_K_M
# Run inference directly in the terminal:
./llama-cli -hf Dhptl/gemma-4-12B-GGUF:Q4_K_M

Build from source code

git clone https://github.com/ggerganov/llama.cpp.git
cd llama.cpp
cmake -B build
cmake --build build -j --target llama-server llama-cli
# Start a local OpenAI-compatible server with a web UI:
./build/bin/llama-server -hf Dhptl/gemma-4-12B-GGUF:Q4_K_M
# Run inference directly in the terminal:
./build/bin/llama-cli -hf Dhptl/gemma-4-12B-GGUF:Q4_K_M

Use Docker

docker model run hf.co/Dhptl/gemma-4-12B-GGUF:Q4_K_M

LM Studio
Jan
Ollama
How to use Dhptl/gemma-4-12B-GGUF with Ollama:
```
ollama run hf.co/Dhptl/gemma-4-12B-GGUF:Q4_K_M
```

Unsloth Studio

How to use Dhptl/gemma-4-12B-GGUF with Unsloth Studio:

Install Unsloth Studio (macOS, Linux, WSL)

curl -fsSL https://unsloth.ai/install.sh | sh
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for Dhptl/gemma-4-12B-GGUF to start chatting

Install Unsloth Studio (Windows)

irm https://unsloth.ai/install.ps1 | iex
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for Dhptl/gemma-4-12B-GGUF to start chatting

Using HuggingFace Spaces for Unsloth

# No setup required
# Open https://huggingface.co/spaces/unsloth/studio in your browser
# Search for Dhptl/gemma-4-12B-GGUF to start chatting

Atomic Chat new
Docker Model Runner
How to use Dhptl/gemma-4-12B-GGUF with Docker Model Runner:
```
docker model run hf.co/Dhptl/gemma-4-12B-GGUF:Q4_K_M
```

Lemonade

How to use Dhptl/gemma-4-12B-GGUF with Lemonade:

Pull the model

# Download Lemonade from https://lemonade-server.ai/
lemonade pull Dhptl/gemma-4-12B-GGUF:Q4_K_M

Run and chat with the model

lemonade run user.gemma-4-12B-GGUF-Q4_K_M

List all available models

lemonade list

gemma-4-12B — GGUF Quantizations

Quantized GGUF versions of google/gemma-4-12B.

These files work with llama.cpp, Ollama, LM Studio, Jan, and any other GGUF-compatible runtime.

Quantized by Dhptl on June 09, 2026

📦 Available Files

Filename	Size	Quant	Use Case
`gemma-4-12B-IQ4_XS.gguf`	6.23 GB	`IQ4_XS`	Minimal RAM usage
`gemma-4-12B-Q4_K_M.gguf`	6.87 GB	`Q4_K_M` ✅ Recommended	General use, everyday inference
`gemma-4-12B-Q5_K_M.gguf`	7.96 GB	`Q5_K_M`	When you want a bit more accuracy
`gemma-4-12B-Q8_0.gguf`	11.80 GB	`Q8_0`	High-quality inference, evaluation

Which file should I download?

If you have...	Download this
8 GB RAM	`IQ4_XS` — Smallest, runs on 8GB
10 GB RAM	`Q4_K_M` — Best choice ✅
12 GB RAM	`Q5_K_M` — Better quality
16 GB+ RAM	`Q8_0` — Near-original quality

🧠 Original Model Quality Benchmarks

Results from Gemma 4 12B (Base) — reported by Google. Results reported by Google on the base model. These benchmarks apply to the original BF16 model. GGUF quantization preserves ~98–99% of quality for Q5/Q8 and ~96–97% for Q4 variants.

Benchmark	Category	Score
MMLU Pro	Text	77.2%
GPQA Diamond	Science	78.8%
AIME 2026 (no tools)	Math	77.5%
LiveCodeBench v6	Coding	72.0%
BigBench Extra Hard	Reasoning	53.0%
MMMLU	Multilingual	83.4%
MMMU Pro	Vision	69.1%
MRCR v2 8-needle 128k	Long Context	43.4%

📊 Speed Benchmarks

Tested on: Intel(R) Core(TM) Ultra 7 258V | 31.5GB RAM | Intel Arc 140V (Vulkan)

Model	Size	Generation	Prompt Processing
`gemma-4-12B-IQ4_XS.gguf`	6.23 GB	8.1 tok/s	249.7 tok/s
`gemma-4-12B-Q4_K_M.gguf`	6.87 GB	10.9 tok/s	232.2 tok/s
`gemma-4-12B-Q5_K_M.gguf`	7.96 GB	9.6 tok/s	244.9 tok/s
`gemma-4-12B-Q8_0.gguf`	11.8 GB	6.8 tok/s	267.2 tok/s

Generation speed = how fast the model outputs tokens (higher = better). Prompt processing = how fast it reads your input (higher = better). Results vary by hardware and system load.

🚀 How to Use

With Ollama

ollama run Dhptl/gemma-4-12b

With llama.cpp

./llama-cli -m gemma-4-12B-Q4_K_M.gguf -p "Your prompt here" -n 512

With LM Studio

Open LM Studio
Search for Dhptl/gemma-4-12B
Download your preferred quant
Load and chat

With Python (llama-cpp-python)

from llama_cpp import Llama

llm = Llama(
    model_path="./gemma-4-12B-Q4_K_M.gguf",
    n_ctx=4096,
    n_gpu_layers=-1,  # -1 = offload all layers to GPU
)

output = llm("Explain quantum computing in simple terms:", max_tokens=256)
print(output["choices"][0]["text"])

🔧 Quantization Details

Format	Bits	Description
`Q4_K_M`	4-bit	K-quantization, medium — Best size/quality balance
`Q5_K_M`	5-bit	K-quantization, medium — Higher quality
`Q8_0`	8-bit	Near-lossless — Largest GGUF file
`IQ4_XS`	~4-bit	Importance-matrix quant — Smallest with good quality

Quantization was done using llama.cpp.

ℹ️ About the Original Model

Original Model: google/gemma-4-12B
Architecture: Gemma 4 Unified (multimodal — text + vision capable)
Parameters: ~12 Billion
Context Length: 128K tokens
License: Gemma Terms of Use

💬 Feedback

If you find issues or have questions, open a discussion.

If these quants are useful to you, please ⭐ the repo!

Downloads last month: 2,049

GGUF

Model size

12B params

Architecture

gemma4

Hardware compatibility

4-bit

5-bit

8-bit

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for Dhptl/gemma-4-12B-GGUF

Base model

google/gemma-4-12B

Quantized

(34)

this model