YAML Metadata Warning:empty or missing yaml metadata in repo card

Check out the documentation for more information.

Gemma-4-26B-A4B-it — HPC Q2_K + Q4_0·HPC

The smallest functional quantization of Gemma 4 26B MoE. 10.6 GB. Runs on 12 GB hardware. 35-60 t/s on a single RTX 3060.

No other public quantization of this model fits in 12 GB VRAM. The smallest community quant (Q4_K_M) is ~17 GB and requires 20+ GB at runtime. This fits and runs with headroom to spare — on hardware that costs under $300.

This is not a typical Q2 quant. Standard Q2 quantization destroys reasoning capability.

HPC uses anisotropic error optimization — D₆ vesica gate error shaping + global belief propagation — to push quantization noise into dimensions orthogonal to the computation flow. The reasoning substrate survives intact.

Note: This model is not as powerful as 31B of course, but can still solve complex reasoning problems given enough context like the 25 horses prompt.


Model Details

Base Model google/gemma-4-26B-A4B-it
Architecture Gemma 4 MoE — 25.8B total params, 4B active per token, 30 layers, 64 experts
Quantization Mixed Q2_K (2.63 bpw) + Q4_0·HPC (4.5 bpw)
File Size 9.5 GB
Format GGUF v3 — compatible with llama.cpp, LM Studio, Ollama
Quantizer HPC
iMatrix 39 hours of activation sampling on coding benchmarks

Precision Tiers

Layer Type Quantization BPW Method
Attention Q/K/V/O Q4_0·HPC 4.5 24-beam Hensel search + triality BP (16 candidates)
FFN gate/up/down Q2_K·HPC 2.63 24-beam Hensel search + triality BP (16×16 = 256 candidates)
MoE expert tensors Q4_0·HPC 4.5 Non-256-aligned dims fallback (704, 2112 inner dims)
Embeddings / Norms / Router F32 32 Preserved

Tensor Distribution

Type Count Purpose
Q4_0·HPC ~200 Attention projections + MoE expert tensors
Q2_K·HPC ~180 Dense FFN / MLP weights
F32 ~278 Embeddings, norms, biases, router gates
Total 658

Size Comparison

Quantization Size Fits 12 GB? Source
BF16 48.5 GB Google
Q8_0 ~27 GB Community
Q6_K ~22 GB Community
Q4_K_M ~17 GB LM Studio / bartowski
IQ3_K_XXS ~12 GB ⚠️ Unsloth
This 10.6 GB HPC

Quick Start

LM Studio

  1. Download the GGUF
  2. Place in your LM Studio models directory
  3. Load and chat — LM Studio auto-detects the Gemma 4 template

llama.cpp Server

# Download the updated Gemma 4 chat template (required for correct output)
curl -L -o gemma4_chat_template.jinja \
  "https://huggingface.co/google/gemma-4-26B-A4B-it/raw/main/chat_template.jinja"

# Launch the server
llama-server \
  -m Gemma-4-26B-A4B-it-Q2_K.gguf \
  -ngl 0 \
  -c 4096 \
  --host 0.0.0.0 --port 8989 \
  --jinja \
  --chat-template-file gemma4_chat_template.jinja \
  --cache-ram 0 \
  -ctxcp 1

Important flags:

  • --jinja --chat-template-file — Uses Google's latest Gemma 4 template. The template embedded in older GGUFs is broken. Without this, you get garbage output.
  • --cache-ram 0 -ctxcp 1 — Prevents the sliding window attention checkpoint RAM explosion that affects all Gemma 4 models.
  • -ngl 0 — CPU-only. Increase for GPU offload (e.g., -ngl 30 for partial offload on 12 GB VRAM).

llama.cpp CLI

llama-cli \
  -m Gemma-4-26B-A4B-it-Q2_K.gguf \
  --jinja \
  --chat-template-file gemma4_chat_template.jinja \
  -p "Implement a lock-free MPSC queue in C" \
  -n 512 --temp 0 --repeat-penalty 1.5 --no-mmap --reasoning-budget 4512

Ollama

⚠️ Ollama has known issues with Gemma 4. If you get garbage output, switch to llama.cpp server or LM Studio. This is an Ollama-side problem, not a model issue.

FROM ./Gemma-4-26B-A4B-it-Q2_K.gguf

PARAMETER temperature 0
PARAMETER num_ctx 4096
PARAMETER repeat_penalty 1.5
PARAMETER top_k 1
PARAMETER mlock true

API Usage

Once the server is running, use the OpenAI-compatible API:

curl http://localhost:8989/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "messages": [{"role": "user", "content": "Write a concurrent hash map in C"}],
    "temperature": 0,
    "max_tokens": 512,
    "repeat_penalty": 1.5
  }'

Recommended Settings

Parameter Value Why
temperature 0 Deterministic — eliminates sampling noise at low BPW, prevents token repetition loops
repeat_penalty 1.5 High penalty aggressively suppresses repeating tokens, critical for coherent output at 2-bit
top_k 1 Greedy decoding — always pick the highest-probability token
top_p 1.0 Disabled when temp=0 (no effect with greedy decoding)
context 2048–4096 Higher contexts increase RAM usage significantly

Why temp=0 with high repeat penalty? At 2-bit quantization, the probability distribution over tokens is noisier than the original model. Non-zero temperature amplifies this noise, causing the model to sample low-confidence tokens that trigger self-correction loops. Setting temperature 0 forces greedy decoding — always picking the most likely token — which keeps output on the model's strongest signal. The high repeat_penalty (1.5) prevents the degenerate case where greedy decoding gets stuck in a loop, penalizing any token the model has already emitted.


How It Works

Standard quantizers use round-to-nearest: for each weight block, compute a scale and round. This uses HPC beam search with triality-enhanced belief propagation — a fundamentally different approach.

The Pipeline

┌─────────────────────────────────────────────────────────────┐
│  For each weight tensor:                                     │
│                                                              │
│  1. Compute greedy reference scales per block                │
│  2. Generate candidate grid (16×16 = 256 scale variants)     │
│  3. Encode candidates as Z₆ complex amplitudes               │
│  4. Build constraint graph (inter-block coupling)            │
│  5. Run belief propagation in 3 simultaneous views:          │
│       Edge × Vertex × Diagonal (triality)                    │
│  6. Combine via geometric mean:                              │
│       marginal[v] = ∛(edge × vertex × diagonal)             │
│  7. 24-beam Hensel search using combined marginals           │
│       (6,144 extensions evaluated per block)                 │
│  8. D₆ vesica gate error shaping per sub-block              │
│  9. Pack into GGUF blocks with optimal scales                │
└─────────────────────────────────────────────────────────────┘

D₆ Vesica Gate Error Shaping

After the beam search selects optimal scales, the vesica gate shapes the final rounding decisions within each 16-element sub-block. Instead of independent rounding, it decomposes the error vector using the D₆ antipodal fold:

vesica[k] = e[k] + e[k+3]   →  DC-like, propagates in dot products
wave[k]   = e[k] - e[k+3]   →  noise-like, cancels in dot products

The gate greedily flips rounding decisions (floor↔ceil) to minimize vesica energy while allowing wave energy to increase. This exploits the local correlation structure of transformer weights — wave error cancels during inference because nearby weights in a sub-block tend to activate similarly.

Why Attention Gets Q4_0

Quantization noise in attention projections cascades through softmax(Q·K^T/√d)·V. A single bad scale in a Q block shifts dot products enough to promote wrong tokens — manifesting as:

  • Korean/Arabic character injection
  • Word substitutions
  • Self-correction loops

Promoting Q/K/V/O to Q4_0 (16 levels vs 4) eliminates these artifacts at a cost of only ~1 GB.

Why Three Views?

Single-view BP can converge to locally optimal but globally poor configurations.

Running in three simultaneous bases (Edge=computational, Vertex=Fourier, Diagonal=conjugate) and combining via geometric mean prevents this.

The result: zero e-02 RMSE outliers across all attention tensors.

RMSE Quality

Metric Value
Q4_0·HPC token embedding RMSE 1.27e-03
Q4_0·HPC attention RMSE range 2.4–3.1e-03
Q2_K·HPC dense FFN RMSE range 1.8–2.5e-02
Q4_0·HPC MoE expert RMSE range 1.8–2.0e-02
e-02 outliers (attention) 0

Reasoning Verification

All tests run at --temp 0 --repeat-penalty 1.5 on a single RTX 3060 12GB. Zero cherry-picking — every result shown is from the first attempt.

Algorithm Implementation

Test Difficulty Result
Lock-Free MPSC Queue (C) — 1024-slot fixed ring buffer with C11 atomics Expert ✅ Correct lock-free algorithm, correct memory ordering, validated with multi-threaded test harness
Concurrent Hash Map (C) — thread-safe with fine-grained locking Hard ✅ Correct bucket-level locking, correct resize logic
Code Generation — coherent C and TypeScript output Medium ✅ No garbage tokens, no character injection, structurally correct

What This Means

Standard Q2 quantization produces models that can barely maintain coherent conversation. This Q2 quant:

  • Implements correct concurrent data structures from scratch
  • Generates production-quality code without token corruption
  • Achieves Q5-equivalent reasoning at Q2 file size
  • Fits a 25.8B parameter MoE model in 9.5 GB on $300 hardware

The quantization noise is still there — the RMSE proves it — but the D₆ vesica gate has rotated it into dimensions the transformer doesn't use for reasoning.


Gemma 4 26B MoE Architecture

Unlike the 31B dense variant, the 26B uses Mixture of Experts (MoE) — only a fraction of parameters are active per token, making it faster at inference despite similar total parameter count.

Property Value
Total parameters 25.8B
Active parameters per token ~4B
Hidden size 3072
Layers 30
Attention heads 32
KV heads 4 (GQA)
Head dim 256
Experts per layer 64
Active experts per token 4 (top-k routing)
FFN intermediate 2112 (per expert)
Sliding window 1024 tokens
Full attention Every 4th layer
Max context 131,072 tokens
Vocab size 262,144
Activation GeLU (tanh approx)

MoE Expert Routing

Each MoE layer selects the top-4 experts per token via a learned router. At inference, only 4 of 64 experts fire — giving the model the capacity of 25.8B params with the compute cost of ~4B.

Token → Router (softmax) → Top-4 experts → Weighted sum → Output
         64 experts available, 4 selected per token

MoE-Specific Quantization Handling

Challenge Solution
Expert inner dims (704, 2112) not multiples of 256 Q4_0·HPC fallback (32 divides both)
30 × 8 packed expert tensors (ffn_gate_up_exps) Chunked processing in C engine
Sparse activation (most experts idle per token) Router weights preserved at F32
Non-uniform weight distributions across experts iMatrix-weighted per-expert importance
39-hour iMatrix calibration Coding benchmark data for MoE expert activation coverage

Known Limitations

  1. Safety alignment degradation — extreme quantization (< 3 BPW) can weaken RLHF guardrails. The model may comply with requests the original would refuse. Evaluate safety properties before deployment.

  2. Ollama compatibility — Ollama's Gemma 4 support is unreliable as of April 2026. Use llama.cpp or LM Studio.

  3. MoE expert tensor precision — expert tensors use Q4_0 instead of Q2_K due to non-256-aligned dimensions. This is a structural constraint of the Q2_K format, not a limitation of HPC.

  4. Long-context (8K+) stress testing — verification suite covers < 5K tokens. Long-context coherence is expected to hold but has not been formally benchmarked.


Technical Details

Q2_K Block Layout (84 bytes / 256 weights)

Offset  Size  Field
  0      16   scales[16]    4-bit scale | 4-bit min per sub-block
 16      64   qs[64]        packed 2-bit quants (4 per byte)
 80       2   d             fp16 super-block scale
 82       2   dmin          fp16 super-block min scale

Q4_0 Block Layout (18 bytes / 32 weights)

Offset  Size  Field
  0       2   d             fp16 block scale
  2      16   qs[16]        packed 4-bit quants (2 per byte)
                             nibble order: qs[j] = w[j] | (w[j+16] << 4)

License

This quantization inherits the Gemma license from the base model.

HPC is MIT.

Credits

Quantized with HPC — triality-enhanced belief propagation over hexagonal constraint graphs with D₆ vesica gate error shaping.

Downloads last month
2,534
GGUF
Model size
25B params
Architecture
gemma4
Hardware compatibility
Log In to add your hardware

We're not able to determine the quantization variants.

Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support