gemma-4-31B-it-qat-FP8

google/gemma-4-31B-it-qat-q4_0-unquantized is a 31-billion-parameter instruction-tuned multimodal model from Google DeepMind, optimized using Quantization-Aware Training (QAT) and released in an unquantized Q4_0 checkpoint format for research, custom compilation, and downstream quantization workflows. The model supports text and image inputs with text generation outputs, features a 256K-token context window, native reasoning ("thinking") capabilities, function calling, multilingual support across 140+ languages, and strong performance in coding, reasoning, document understanding, and long-context tasks. Unlike the GGUF release, this checkpoint preserves the QAT-trained weights before final deployment quantization, making it particularly suitable for experimentation with custom inference engines, FP8/NVFP4 quantization, and production optimization frameworks while maintaining quality close to the original high-precision model.

recipe.yaml

default_stage:
  default_modifiers:
    QuantizationModifier:
      targets: [Linear]
      ignore: [lm_head, 're:.*vision_tower.*', 're:.*embed_vision.*']
      scheme: FP8_DYNAMIC
      bypass_divisibility_checks: false

llm-compressor

An open-source library developed by the vLLM team, designed to optimize Large Language Models (LLMs) for production deployment — https://github.com/vllm-project/llm-compressor

Downloads last month
161
Safetensors
Model size
33B params
Tensor type
BF16
·
F8_E4M3
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for prithivMLmods/gemma-4-31B-it-qat-FP8

Quantized
(23)
this model

Collection including prithivMLmods/gemma-4-31B-it-qat-FP8