gemma-4-12B-it-fp8

FP8 quantized version of google/gemma-4-12B-it (12B params, unified multimodal model). Produced and maintained by vrfai.

Quantization Details

This model was quantized using NVIDIA ModelOpt with the following configurations:

Property Value
Base model google/gemma-4-12B-it
Quant method NVIDIA ModelOpt (FP8 E4M3 - num_bits: (4, 3))
Weight scheme Per-channel (axis: 0)
Input activation Dynamic Per-token (type: dynamic)
Calibration dataset CNN DailyMail (512 samples, max_seq_len 1024)
Calibration algorithm max
Size ~15 GB (vs ~23 GB BF16)

Excluded from Quantization

The following modules are kept in full precision (BF16) to preserve accuracy:

  • lm_head
  • model.embed_vision*
  • model.embed_audio*
  • All self_attn layers (layers 0–47)

Quantization Script

The recipes and scripts used to quantize this model can be found in the following repository:

Downloads last month
2,942
Safetensors
Model size
12B params
Tensor type
F32
·
BF16
·
F8_E4M3
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for vrfai/gemma-4-12B-it-fp8

Quantized
(100)
this model

Collection including vrfai/gemma-4-12B-it-fp8