Instructions to use CompressedGemma/gemma-4-26B-A4B-it-compressed with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- llama-cpp-python
How to use CompressedGemma/gemma-4-26B-A4B-it-compressed with llama-cpp-python:
# !pip install llama-cpp-python from llama_cpp import Llama llm = Llama.from_pretrained( repo_id="CompressedGemma/gemma-4-26B-A4B-it-compressed", filename="Gemma-26B-it-quant.gguf", )
llm.create_chat_completion( messages = "No input example has been defined for this model task." )
- Notebooks
- Google Colab
- Kaggle
- Local Apps
- llama.cpp
How to use CompressedGemma/gemma-4-26B-A4B-it-compressed with llama.cpp:
Install from brew
brew install llama.cpp # Start a local OpenAI-compatible server with a web UI: llama-server -hf CompressedGemma/gemma-4-26B-A4B-it-compressed # Run inference directly in the terminal: llama-cli -hf CompressedGemma/gemma-4-26B-A4B-it-compressed
Install from WinGet (Windows)
winget install llama.cpp # Start a local OpenAI-compatible server with a web UI: llama-server -hf CompressedGemma/gemma-4-26B-A4B-it-compressed # Run inference directly in the terminal: llama-cli -hf CompressedGemma/gemma-4-26B-A4B-it-compressed
Use pre-built binary
# Download pre-built binary from: # https://github.com/ggerganov/llama.cpp/releases # Start a local OpenAI-compatible server with a web UI: ./llama-server -hf CompressedGemma/gemma-4-26B-A4B-it-compressed # Run inference directly in the terminal: ./llama-cli -hf CompressedGemma/gemma-4-26B-A4B-it-compressed
Build from source code
git clone https://github.com/ggerganov/llama.cpp.git cd llama.cpp cmake -B build cmake --build build -j --target llama-server llama-cli # Start a local OpenAI-compatible server with a web UI: ./build/bin/llama-server -hf CompressedGemma/gemma-4-26B-A4B-it-compressed # Run inference directly in the terminal: ./build/bin/llama-cli -hf CompressedGemma/gemma-4-26B-A4B-it-compressed
Use Docker
docker model run hf.co/CompressedGemma/gemma-4-26B-A4B-it-compressed
- LM Studio
- Jan
- Ollama
How to use CompressedGemma/gemma-4-26B-A4B-it-compressed with Ollama:
ollama run hf.co/CompressedGemma/gemma-4-26B-A4B-it-compressed
- Unsloth Studio new
How to use CompressedGemma/gemma-4-26B-A4B-it-compressed with Unsloth Studio:
Install Unsloth Studio (macOS, Linux, WSL)
curl -fsSL https://unsloth.ai/install.sh | sh # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for CompressedGemma/gemma-4-26B-A4B-it-compressed to start chatting
Install Unsloth Studio (Windows)
irm https://unsloth.ai/install.ps1 | iex # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for CompressedGemma/gemma-4-26B-A4B-it-compressed to start chatting
Using HuggingFace Spaces for Unsloth
# No setup required # Open https://huggingface.co/spaces/unsloth/studio in your browser # Search for CompressedGemma/gemma-4-26B-A4B-it-compressed to start chatting
- Pi new
How to use CompressedGemma/gemma-4-26B-A4B-it-compressed with Pi:
Start the llama.cpp server
# Install llama.cpp: brew install llama.cpp # Start a local OpenAI-compatible server: llama-server -hf CompressedGemma/gemma-4-26B-A4B-it-compressed
Configure the model in Pi
# Install Pi: npm install -g @mariozechner/pi-coding-agent # Add to ~/.pi/agent/models.json: { "providers": { "llama-cpp": { "baseUrl": "http://localhost:8080/v1", "api": "openai-completions", "apiKey": "none", "models": [ { "id": "CompressedGemma/gemma-4-26B-A4B-it-compressed" } ] } } }Run Pi
# Start Pi in your project directory: pi
- Hermes Agent new
How to use CompressedGemma/gemma-4-26B-A4B-it-compressed with Hermes Agent:
Start the llama.cpp server
# Install llama.cpp: brew install llama.cpp # Start a local OpenAI-compatible server: llama-server -hf CompressedGemma/gemma-4-26B-A4B-it-compressed
Configure Hermes
# Install Hermes: curl -fsSL https://hermes-agent.nousresearch.com/install.sh | bash hermes setup # Point Hermes at the local server: hermes config set model.provider custom hermes config set model.base_url http://127.0.0.1:8080/v1 hermes config set model.default CompressedGemma/gemma-4-26B-A4B-it-compressed
Run Hermes
hermes
- Docker Model Runner
How to use CompressedGemma/gemma-4-26B-A4B-it-compressed with Docker Model Runner:
docker model run hf.co/CompressedGemma/gemma-4-26B-A4B-it-compressed
- Lemonade
How to use CompressedGemma/gemma-4-26B-A4B-it-compressed with Lemonade:
Pull the model
# Download Lemonade from https://lemonade-server.ai/ lemonade pull CompressedGemma/gemma-4-26B-A4B-it-compressed
Run and chat with the model
lemonade run user.gemma-4-26B-A4B-it-compressed-{{QUANT_TAG}}List all available models
lemonade list
YAML Metadata Warning:empty or missing yaml metadata in repo card
Check out the documentation for more information.
Gemma-4-26B-A4B-it — HPC Q2_K + Q4_0·HPC
The smallest functional quantization of Gemma 4 26B MoE. 10.6 GB. Runs on 12 GB hardware. 35-60 t/s on a single RTX 3060.
No other public quantization of this model fits in 12 GB VRAM. The smallest community quant (Q4_K_M) is ~17 GB and requires 20+ GB at runtime. This fits and runs with headroom to spare — on hardware that costs under $300.
This is not a typical Q2 quant. Standard Q2 quantization destroys reasoning capability.
HPC uses anisotropic error optimization — D₆ vesica gate error shaping + global belief propagation — to push quantization noise into dimensions orthogonal to the computation flow. The reasoning substrate survives intact.
Note: This model is not as powerful as 31B of course, but can still solve complex reasoning problems given enough context like the 25 horses prompt.
Model Details
| Base Model | google/gemma-4-26B-A4B-it |
| Architecture | Gemma 4 MoE — 25.8B total params, 4B active per token, 30 layers, 64 experts |
| Quantization | Mixed Q2_K (2.63 bpw) + Q4_0·HPC (4.5 bpw) |
| File Size | 9.5 GB |
| Format | GGUF v3 — compatible with llama.cpp, LM Studio, Ollama |
| Quantizer | HPC |
| iMatrix | 39 hours of activation sampling on coding benchmarks |
Precision Tiers
| Layer Type | Quantization | BPW | Method |
|---|---|---|---|
| Attention Q/K/V/O | Q4_0·HPC | 4.5 | 24-beam Hensel search + triality BP (16 candidates) |
| FFN gate/up/down | Q2_K·HPC | 2.63 | 24-beam Hensel search + triality BP (16×16 = 256 candidates) |
| MoE expert tensors | Q4_0·HPC | 4.5 | Non-256-aligned dims fallback (704, 2112 inner dims) |
| Embeddings / Norms / Router | F32 | 32 | Preserved |
Tensor Distribution
| Type | Count | Purpose |
|---|---|---|
| Q4_0·HPC | ~200 | Attention projections + MoE expert tensors |
| Q2_K·HPC | ~180 | Dense FFN / MLP weights |
| F32 | ~278 | Embeddings, norms, biases, router gates |
| Total | 658 |
Size Comparison
| Quantization | Size | Fits 12 GB? | Source |
|---|---|---|---|
| BF16 | 48.5 GB | ❌ | |
| Q8_0 | ~27 GB | ❌ | Community |
| Q6_K | ~22 GB | ❌ | Community |
| Q4_K_M | ~17 GB | ❌ | LM Studio / bartowski |
| IQ3_K_XXS | ~12 GB | ⚠️ | Unsloth |
| This | 10.6 GB | ✅ | HPC |
Quick Start
LM Studio
- Download the GGUF
- Place in your LM Studio models directory
- Load and chat — LM Studio auto-detects the Gemma 4 template
llama.cpp Server
# Download the updated Gemma 4 chat template (required for correct output)
curl -L -o gemma4_chat_template.jinja \
"https://huggingface.co/google/gemma-4-26B-A4B-it/raw/main/chat_template.jinja"
# Launch the server
llama-server \
-m Gemma-4-26B-A4B-it-Q2_K.gguf \
-ngl 0 \
-c 4096 \
--host 0.0.0.0 --port 8989 \
--jinja \
--chat-template-file gemma4_chat_template.jinja \
--cache-ram 0 \
-ctxcp 1
Important flags:
--jinja --chat-template-file— Uses Google's latest Gemma 4 template. The template embedded in older GGUFs is broken. Without this, you get garbage output.--cache-ram 0 -ctxcp 1— Prevents the sliding window attention checkpoint RAM explosion that affects all Gemma 4 models.-ngl 0— CPU-only. Increase for GPU offload (e.g.,-ngl 30for partial offload on 12 GB VRAM).
llama.cpp CLI
llama-cli \
-m Gemma-4-26B-A4B-it-Q2_K.gguf \
--jinja \
--chat-template-file gemma4_chat_template.jinja \
-p "Implement a lock-free MPSC queue in C" \
-n 512 --temp 0 --repeat-penalty 1.5 --no-mmap --reasoning-budget 4512
Ollama
⚠️ Ollama has known issues with Gemma 4. If you get garbage output, switch to llama.cpp server or LM Studio. This is an Ollama-side problem, not a model issue.
FROM ./Gemma-4-26B-A4B-it-Q2_K.gguf
PARAMETER temperature 0
PARAMETER num_ctx 4096
PARAMETER repeat_penalty 1.5
PARAMETER top_k 1
PARAMETER mlock true
API Usage
Once the server is running, use the OpenAI-compatible API:
curl http://localhost:8989/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"messages": [{"role": "user", "content": "Write a concurrent hash map in C"}],
"temperature": 0,
"max_tokens": 512,
"repeat_penalty": 1.5
}'
Recommended Settings
| Parameter | Value | Why |
|---|---|---|
temperature |
0 | Deterministic — eliminates sampling noise at low BPW, prevents token repetition loops |
repeat_penalty |
1.5 | High penalty aggressively suppresses repeating tokens, critical for coherent output at 2-bit |
top_k |
1 | Greedy decoding — always pick the highest-probability token |
top_p |
1.0 | Disabled when temp=0 (no effect with greedy decoding) |
context |
2048–4096 | Higher contexts increase RAM usage significantly |
Why temp=0 with high repeat penalty? At 2-bit quantization, the probability distribution over tokens is noisier than the original model. Non-zero temperature amplifies this noise, causing the model to sample low-confidence tokens that trigger self-correction loops. Setting
temperature 0forces greedy decoding — always picking the most likely token — which keeps output on the model's strongest signal. The highrepeat_penalty(1.5) prevents the degenerate case where greedy decoding gets stuck in a loop, penalizing any token the model has already emitted.
How It Works
Standard quantizers use round-to-nearest: for each weight block, compute a scale and round. This uses HPC beam search with triality-enhanced belief propagation — a fundamentally different approach.
The Pipeline
┌─────────────────────────────────────────────────────────────┐
│ For each weight tensor: │
│ │
│ 1. Compute greedy reference scales per block │
│ 2. Generate candidate grid (16×16 = 256 scale variants) │
│ 3. Encode candidates as Z₆ complex amplitudes │
│ 4. Build constraint graph (inter-block coupling) │
│ 5. Run belief propagation in 3 simultaneous views: │
│ Edge × Vertex × Diagonal (triality) │
│ 6. Combine via geometric mean: │
│ marginal[v] = ∛(edge × vertex × diagonal) │
│ 7. 24-beam Hensel search using combined marginals │
│ (6,144 extensions evaluated per block) │
│ 8. D₆ vesica gate error shaping per sub-block │
│ 9. Pack into GGUF blocks with optimal scales │
└─────────────────────────────────────────────────────────────┘
D₆ Vesica Gate Error Shaping
After the beam search selects optimal scales, the vesica gate shapes the final rounding decisions within each 16-element sub-block. Instead of independent rounding, it decomposes the error vector using the D₆ antipodal fold:
vesica[k] = e[k] + e[k+3] → DC-like, propagates in dot products
wave[k] = e[k] - e[k+3] → noise-like, cancels in dot products
The gate greedily flips rounding decisions (floor↔ceil) to minimize vesica energy while allowing wave energy to increase. This exploits the local correlation structure of transformer weights — wave error cancels during inference because nearby weights in a sub-block tend to activate similarly.
Why Attention Gets Q4_0
Quantization noise in attention projections cascades through softmax(Q·K^T/√d)·V. A single bad scale in a Q block shifts dot products enough to promote wrong tokens — manifesting as:
- Korean/Arabic character injection
- Word substitutions
- Self-correction loops
Promoting Q/K/V/O to Q4_0 (16 levels vs 4) eliminates these artifacts at a cost of only ~1 GB.
Why Three Views?
Single-view BP can converge to locally optimal but globally poor configurations.
Running in three simultaneous bases (Edge=computational, Vertex=Fourier, Diagonal=conjugate) and combining via geometric mean prevents this.
The result: zero e-02 RMSE outliers across all attention tensors.
RMSE Quality
| Metric | Value |
|---|---|
| Q4_0·HPC token embedding RMSE | 1.27e-03 |
| Q4_0·HPC attention RMSE range | 2.4–3.1e-03 |
| Q2_K·HPC dense FFN RMSE range | 1.8–2.5e-02 |
| Q4_0·HPC MoE expert RMSE range | 1.8–2.0e-02 |
| e-02 outliers (attention) | 0 |
Reasoning Verification
All tests run at --temp 0 --repeat-penalty 1.5 on a single RTX 3060 12GB. Zero cherry-picking — every result shown is from the first attempt.
Algorithm Implementation
| Test | Difficulty | Result |
|---|---|---|
| Lock-Free MPSC Queue (C) — 1024-slot fixed ring buffer with C11 atomics | Expert | ✅ Correct lock-free algorithm, correct memory ordering, validated with multi-threaded test harness |
| Concurrent Hash Map (C) — thread-safe with fine-grained locking | Hard | ✅ Correct bucket-level locking, correct resize logic |
| Code Generation — coherent C and TypeScript output | Medium | ✅ No garbage tokens, no character injection, structurally correct |
What This Means
Standard Q2 quantization produces models that can barely maintain coherent conversation. This Q2 quant:
- Implements correct concurrent data structures from scratch
- Generates production-quality code without token corruption
- Achieves Q5-equivalent reasoning at Q2 file size
- Fits a 25.8B parameter MoE model in 9.5 GB on $300 hardware
The quantization noise is still there — the RMSE proves it — but the D₆ vesica gate has rotated it into dimensions the transformer doesn't use for reasoning.
Gemma 4 26B MoE Architecture
Unlike the 31B dense variant, the 26B uses Mixture of Experts (MoE) — only a fraction of parameters are active per token, making it faster at inference despite similar total parameter count.
| Property | Value |
|---|---|
| Total parameters | 25.8B |
| Active parameters per token | ~4B |
| Hidden size | 3072 |
| Layers | 30 |
| Attention heads | 32 |
| KV heads | 4 (GQA) |
| Head dim | 256 |
| Experts per layer | 64 |
| Active experts per token | 4 (top-k routing) |
| FFN intermediate | 2112 (per expert) |
| Sliding window | 1024 tokens |
| Full attention | Every 4th layer |
| Max context | 131,072 tokens |
| Vocab size | 262,144 |
| Activation | GeLU (tanh approx) |
MoE Expert Routing
Each MoE layer selects the top-4 experts per token via a learned router. At inference, only 4 of 64 experts fire — giving the model the capacity of 25.8B params with the compute cost of ~4B.
Token → Router (softmax) → Top-4 experts → Weighted sum → Output
64 experts available, 4 selected per token
MoE-Specific Quantization Handling
| Challenge | Solution |
|---|---|
| Expert inner dims (704, 2112) not multiples of 256 | Q4_0·HPC fallback (32 divides both) |
30 × 8 packed expert tensors (ffn_gate_up_exps) |
Chunked processing in C engine |
| Sparse activation (most experts idle per token) | Router weights preserved at F32 |
| Non-uniform weight distributions across experts | iMatrix-weighted per-expert importance |
| 39-hour iMatrix calibration | Coding benchmark data for MoE expert activation coverage |
Known Limitations
Safety alignment degradation — extreme quantization (< 3 BPW) can weaken RLHF guardrails. The model may comply with requests the original would refuse. Evaluate safety properties before deployment.
Ollama compatibility — Ollama's Gemma 4 support is unreliable as of April 2026. Use llama.cpp or LM Studio.
MoE expert tensor precision — expert tensors use Q4_0 instead of Q2_K due to non-256-aligned dimensions. This is a structural constraint of the Q2_K format, not a limitation of HPC.
Long-context (8K+) stress testing — verification suite covers < 5K tokens. Long-context coherence is expected to hold but has not been formally benchmarked.
Technical Details
Q2_K Block Layout (84 bytes / 256 weights)
Offset Size Field
0 16 scales[16] 4-bit scale | 4-bit min per sub-block
16 64 qs[64] packed 2-bit quants (4 per byte)
80 2 d fp16 super-block scale
82 2 dmin fp16 super-block min scale
Q4_0 Block Layout (18 bytes / 32 weights)
Offset Size Field
0 2 d fp16 block scale
2 16 qs[16] packed 4-bit quants (2 per byte)
nibble order: qs[j] = w[j] | (w[j+16] << 4)
License
This quantization inherits the Gemma license from the base model.
HPC is MIT.
Credits
Quantized with HPC — triality-enhanced belief propagation over hexagonal constraint graphs with D₆ vesica gate error shaping.
- Downloads last month
- 2,534
We're not able to determine the quantization variants.