YAML Metadata Warning:empty or missing yaml metadata in repo card

Check out the documentation for more information.

Gemma 4 31B QAT: 16-bit Quality Quantization πŸ”₯

First Quantization-Aware Training (QAT) for Gemma 4 31B β€” achieving quantization quality on par with 16-bit (BF16) inference.

🎯 Project Goal

Produce a 4-bit quantized Gemma 4 31B that is indistinguishable from the original BF16 model in output quality.

πŸ† Hackathon Impact

  • Novelty: No QAT models exist for Gemma 4 (Google only released QAT for Gemma 3)
  • Impact: Enables running Gemma 4 31B on consumer hardware (~20GB VRAM vs 62GB)
  • Technical depth: Combines cutting-edge QAT research with practical deployment

πŸ”¬ Technical Approach

Method: QAT + LoRA via Unsloth + TorchAO

  1. Load Gemma-4-31B-it in 4-bit (base model)
  2. Apply QAT with LoRA adapters (qat_scheme="int4")
  3. Fine-tune on high-quality instruction data to adapt weights to quantization noise
  4. Export to TorchAO Int4WeightOnlyConfig for inference
  5. Convert to GGUF for broad llama.cpp deployment

Why QAT Beats PTQ

Post-training quantization (PTQ) simply rounds weights to lower precision, causing accuracy loss. QAT simulates quantization during training, allowing the model to learn how to compensate for quantization noise. Results from Unsloth on Gemma 3:

  • Gemma 3 4B: Recovered 66.9% of lost accuracy, +1.0% raw improvement
  • Gemma 3 12B: Recovered 45.5% of lost accuracy, +2.1% raw improvement

For Gemma 4 31B, we expect even better recovery due to larger model capacity (scaling law for QAT shows larger models β†’ better quantization recovery).

πŸ“ Repository Structure

File Purpose
train_gemma4_qat_32gb.py Main training script (optimized for RTX 5090 / 32GB VRAM)
gemma4_qat_train.py Full-featured training with Trackio monitoring
evaluate_gemma4_qat.py KL divergence evaluation vs BF16 baseline
benchmark_qat.py Comprehensive benchmark suite (KL, PPL, speed, generation)
convert_to_gguf.py Convert trained model to GGUF format
gemma4_advanced_ptq.py Advanced PTQ with imatrix calibration (no training required)

πŸš€ Quick Start

Requirements

pip install --upgrade unsloth unsloth_zoo
pip install torchao==0.14.0 transformers datasets trl accelerate

Training

python train_gemma4_qat_32gb.py

Hardware: RTX 5090 (32GB) or A100 (80GB) Time: ~4-8 hours for 2000 steps VRAM: ~28GB peak with load_in_4bit=True

Evaluation

python evaluate_gemma4_qat.py

Convert to GGUF

python convert_to_gguf.py

πŸ“Š Evaluation Metrics

Metric Target Description
KL Divergence < 0.001 Output distribution match vs BF16
Perplexity Within 2% of BF16 WikiText-2 benchmark
MMLU-Pro Within 1-2% of BF16 Knowledge benchmark
Generation Quality Side-by-side Human/LLM-as-judge

πŸ”§ Hyperparameters

QAT_SCHEME = "int4"           # int4 weight-only quantization
LORA_R = 16                   # LoRA rank
LORA_ALPHA = 32               # LoRA alpha
LEARNING_RATE = 2e-4          # AdamW 8-bit
BATCH_SIZE = 1                # Per device
GRAD_ACCUMULATION = 8         # Effective batch = 8
MAX_STEPS = 2000              # Training steps
MAX_SEQ_LENGTH = 1024         # Training context

πŸ“ Dataset

FineTome-100k (mlabonne/FineTome-100k) β€” high-quality instruction dataset with diverse reasoning, coding, math, and conversation data.

πŸ–₯️ Hardware Requirements

Stage Hardware VRAM Notes
Inference (BF16) A100 / H100 62 GB Baseline
Inference (4-bit QAT) RTX 4090 / 5090 20 GB Target
QAT Training RTX 5090 / A100 80GB 28-80 GB With load_in_4bit=True

πŸ”„ Export Formats

After training:

  1. TorchAO (PyTorch native, vLLM compatible)
  2. GGUF (llama.cpp, Ollama)
  3. EXL2 (exllamav2)

πŸ“š Research Background

Key Papers

  1. Scaling Law for Quantization-Aware Training (2025)

    • arXiv:2505.14302
    • QAT error depends on model size, data size, and quantization granularity
    • Larger models recover better from quantization
  2. AWQ: Activation-aware Weight Quantization (2023)

  3. Google Gemma 3 QAT (2025)

    • Blog post
    • Google used QAT to bring Gemma 3 to consumer GPUs

πŸ“Ž Links

πŸ“„ License

Apache 2.0 (same as Gemma 4)


Author: lew96123
Hackathon: Gemma 4 Good Hackathon 2026

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Papers for lew96123/gemma-4-31b-qat