Gemma 4 31B QAT: 16-bit Quality Quantization 🔥

First Quantization-Aware Training (QAT) for Gemma 4 31B — achieving quantization quality on par with 16-bit (BF16) inference.

🎯 Project Goal

Produce a 4-bit quantized Gemma 4 31B that is indistinguishable from the original BF16 model in output quality.

🏆 Hackathon Impact

Novelty: No QAT models exist for Gemma 4 (Google only released QAT for Gemma 3)
Impact: Enables running Gemma 4 31B on consumer hardware (~20GB VRAM vs 62GB)
Technical depth: Combines cutting-edge QAT research with practical deployment

🔬 Technical Approach

Method: QAT + LoRA via Unsloth + TorchAO

Load Gemma-4-31B-it in 4-bit (base model)
Apply QAT with LoRA adapters (qat_scheme="int4")
Fine-tune on high-quality instruction data to adapt weights to quantization noise
Export to TorchAO Int4WeightOnlyConfig for inference
Convert to GGUF for broad llama.cpp deployment

Why QAT Beats PTQ

Post-training quantization (PTQ) simply rounds weights to lower precision, causing accuracy loss. QAT simulates quantization during training, allowing the model to learn how to compensate for quantization noise. Results from Unsloth on Gemma 3:

Gemma 3 4B: Recovered 66.9% of lost accuracy, +1.0% raw improvement
Gemma 3 12B: Recovered 45.5% of lost accuracy, +2.1% raw improvement

For Gemma 4 31B, we expect even better recovery due to larger model capacity (scaling law for QAT shows larger models → better quantization recovery).

📁 Repository Structure

File	Purpose
`train_gemma4_qat_32gb.py`	Main training script (optimized for RTX 5090 / 32GB VRAM)
`gemma4_qat_train.py`	Full-featured training with Trackio monitoring
`evaluate_gemma4_qat.py`	KL divergence evaluation vs BF16 baseline
`benchmark_qat.py`	Comprehensive benchmark suite (KL, PPL, speed, generation)
`convert_to_gguf.py`	Convert trained model to GGUF format
`gemma4_advanced_ptq.py`	Advanced PTQ with imatrix calibration (no training required)

🚀 Quick Start

Requirements

pip install --upgrade unsloth unsloth_zoo
pip install torchao==0.14.0 transformers datasets trl accelerate

Training

python train_gemma4_qat_32gb.py

Hardware: RTX 5090 (32GB) or A100 (80GB) Time: ~4-8 hours for 2000 steps VRAM: ~28GB peak with load_in_4bit=True

Evaluation

python evaluate_gemma4_qat.py

Convert to GGUF

python convert_to_gguf.py

📊 Evaluation Metrics

Metric	Target	Description
KL Divergence	< 0.001	Output distribution match vs BF16
Perplexity	Within 2% of BF16	WikiText-2 benchmark
MMLU-Pro	Within 1-2% of BF16	Knowledge benchmark
Generation Quality	Side-by-side	Human/LLM-as-judge

🔧 Hyperparameters

QAT_SCHEME = "int4"           # int4 weight-only quantization
LORA_R = 16                   # LoRA rank
LORA_ALPHA = 32               # LoRA alpha
LEARNING_RATE = 2e-4          # AdamW 8-bit
BATCH_SIZE = 1                # Per device
GRAD_ACCUMULATION = 8         # Effective batch = 8
MAX_STEPS = 2000              # Training steps
MAX_SEQ_LENGTH = 1024         # Training context

📝 Dataset

FineTome-100k (mlabonne/FineTome-100k) — high-quality instruction dataset with diverse reasoning, coding, math, and conversation data.

🖥️ Hardware Requirements

Stage	Hardware	VRAM	Notes
Inference (BF16)	A100 / H100	62 GB	Baseline
Inference (4-bit QAT)	RTX 4090 / 5090	20 GB	Target
QAT Training	RTX 5090 / A100 80GB	28-80 GB	With load_in_4bit=True

🔄 Export Formats

After training:

TorchAO (PyTorch native, vLLM compatible)
GGUF (llama.cpp, Ollama)
EXL2 (exllamav2)

📚 Research Background

Key Papers

Scaling Law for Quantization-Aware Training (2025)
- arXiv:2505.14302
- QAT error depends on model size, data size, and quantization granularity
- Larger models recover better from quantization
AWQ: Activation-aware Weight Quantization (2023)
- arXiv:2306.00978
- Protects "salient" weights based on activation magnitudes
Google Gemma 3 QAT (2025)
- Blog post
- Google used QAT to bring Gemma 3 to consumer GPUs

📎 Links

Base Model: google/gemma-4-31B-it
Unsloth QAT Docs: Quantization-Aware Training
TorchAO: GitHub

📄 License

Apache 2.0 (same as Gemma 4)

Author: lew96123
Hackathon: Gemma 4 Good Hackathon 2026

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Papers for lew96123/gemma-4-31b-qat

Scaling Law for Quantization-Aware Training

Paper • 2505.14302 • Published May 20, 2025 • 79

AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration

Paper • 2306.00978 • Published Jun 1, 2023 • 13