YAML Metadata Warning:empty or missing yaml metadata in repo card
Check out the documentation for more information.
- Gemma 4 31B QAT: 16-bit Quality Quantization π₯
Gemma 4 31B QAT: 16-bit Quality Quantization π₯
First Quantization-Aware Training (QAT) for Gemma 4 31B β achieving quantization quality on par with 16-bit (BF16) inference.
π― Project Goal
Produce a 4-bit quantized Gemma 4 31B that is indistinguishable from the original BF16 model in output quality.
π Hackathon Impact
- Novelty: No QAT models exist for Gemma 4 (Google only released QAT for Gemma 3)
- Impact: Enables running Gemma 4 31B on consumer hardware (~20GB VRAM vs 62GB)
- Technical depth: Combines cutting-edge QAT research with practical deployment
π¬ Technical Approach
Method: QAT + LoRA via Unsloth + TorchAO
- Load Gemma-4-31B-it in 4-bit (base model)
- Apply QAT with LoRA adapters (
qat_scheme="int4") - Fine-tune on high-quality instruction data to adapt weights to quantization noise
- Export to TorchAO
Int4WeightOnlyConfigfor inference - Convert to GGUF for broad llama.cpp deployment
Why QAT Beats PTQ
Post-training quantization (PTQ) simply rounds weights to lower precision, causing accuracy loss. QAT simulates quantization during training, allowing the model to learn how to compensate for quantization noise. Results from Unsloth on Gemma 3:
- Gemma 3 4B: Recovered 66.9% of lost accuracy, +1.0% raw improvement
- Gemma 3 12B: Recovered 45.5% of lost accuracy, +2.1% raw improvement
For Gemma 4 31B, we expect even better recovery due to larger model capacity (scaling law for QAT shows larger models β better quantization recovery).
π Repository Structure
| File | Purpose |
|---|---|
train_gemma4_qat_32gb.py |
Main training script (optimized for RTX 5090 / 32GB VRAM) |
gemma4_qat_train.py |
Full-featured training with Trackio monitoring |
evaluate_gemma4_qat.py |
KL divergence evaluation vs BF16 baseline |
benchmark_qat.py |
Comprehensive benchmark suite (KL, PPL, speed, generation) |
convert_to_gguf.py |
Convert trained model to GGUF format |
gemma4_advanced_ptq.py |
Advanced PTQ with imatrix calibration (no training required) |
π Quick Start
Requirements
pip install --upgrade unsloth unsloth_zoo
pip install torchao==0.14.0 transformers datasets trl accelerate
Training
python train_gemma4_qat_32gb.py
Hardware: RTX 5090 (32GB) or A100 (80GB) Time: ~4-8 hours for 2000 steps VRAM: ~28GB peak with load_in_4bit=True
Evaluation
python evaluate_gemma4_qat.py
Convert to GGUF
python convert_to_gguf.py
π Evaluation Metrics
| Metric | Target | Description |
|---|---|---|
| KL Divergence | < 0.001 | Output distribution match vs BF16 |
| Perplexity | Within 2% of BF16 | WikiText-2 benchmark |
| MMLU-Pro | Within 1-2% of BF16 | Knowledge benchmark |
| Generation Quality | Side-by-side | Human/LLM-as-judge |
π§ Hyperparameters
QAT_SCHEME = "int4" # int4 weight-only quantization
LORA_R = 16 # LoRA rank
LORA_ALPHA = 32 # LoRA alpha
LEARNING_RATE = 2e-4 # AdamW 8-bit
BATCH_SIZE = 1 # Per device
GRAD_ACCUMULATION = 8 # Effective batch = 8
MAX_STEPS = 2000 # Training steps
MAX_SEQ_LENGTH = 1024 # Training context
π Dataset
FineTome-100k (mlabonne/FineTome-100k) β high-quality instruction dataset with diverse reasoning, coding, math, and conversation data.
π₯οΈ Hardware Requirements
| Stage | Hardware | VRAM | Notes |
|---|---|---|---|
| Inference (BF16) | A100 / H100 | 62 GB | Baseline |
| Inference (4-bit QAT) | RTX 4090 / 5090 | 20 GB | Target |
| QAT Training | RTX 5090 / A100 80GB | 28-80 GB | With load_in_4bit=True |
π Export Formats
After training:
- TorchAO (PyTorch native, vLLM compatible)
- GGUF (llama.cpp, Ollama)
- EXL2 (exllamav2)
π Research Background
Key Papers
Scaling Law for Quantization-Aware Training (2025)
- arXiv:2505.14302
- QAT error depends on model size, data size, and quantization granularity
- Larger models recover better from quantization
AWQ: Activation-aware Weight Quantization (2023)
- arXiv:2306.00978
- Protects "salient" weights based on activation magnitudes
Google Gemma 3 QAT (2025)
- Blog post
- Google used QAT to bring Gemma 3 to consumer GPUs
π Links
- Base Model: google/gemma-4-31B-it
- Unsloth QAT Docs: Quantization-Aware Training
- TorchAO: GitHub
π License
Apache 2.0 (same as Gemma 4)
Author: lew96123
Hackathon: Gemma 4 Good Hackathon 2026