Gemma 4 12B IT NVFP4 — r0b0tlab v0 release

v0 — quantization artifact, no engine-side verification yet. This release contains the NVFP4 (W4A4) quantization of google/gemma-4-12B-it produced with NVIDIA Model Optimizer. The artifact is complete and self-consistent; we have not yet verified a full inference-engine run end-to-end on this checkpoint (see "Engine support" below). A v0.1 follow-up will ship with throughput, latency, and wikitext-2 perplexity numbers once the engine side is wired up.

Engine support (status as of 2026-06-03)

Engine Status
transformers (≥ dev main) Loads the BF16 base model. Cannot load NVFP4 packed weights (uint8 FP4).
vLLM (≥ 0.22.0) Blocked: Gemma4UnifiedForConditionalGeneration is not in vLLM's model registry; it falls back to TransformersMultiModalForCausalLM which crashes inside flashinfer_scaled_fp4_mm with a 3D→2D activation shape mismatch. We are working on a custom registry registration.
SGLang (dev image) Blocked: same registry gap as vLLM, plus a deeper issue — SGLang's Gemma4DecoderLayer does not match the 12B Unified's full-attention layer shape (head_dim=512, no v_proj because attention_k_eq_v=True).
TensorRT-LLM Not yet evaluated.
llama.cpp / GGUF Not yet evaluated.

Practical advice right now: if you want to use this checkpoint, the cleanest path is to load it in transformers (dev main) and dequantize the FP4 weights to BF16 yourself, then run inference. This loses the speed benefit of FP4 but lets you validate the model. A v0.1 follow-up will publish a working engine path.

Credits and Attribution

This checkpoint was produced by r0b0tlab (@mr-r0b0t on X). It is derived work built on top of the following projects, models, datasets, and tools — all of which deserve direct credit:

Base model

  • google/gemma-4-12B-it — Google DeepMind. The Gemma 4 12B Unified instruction-tuned multimodal model. The architecture is Gemma4UnifiedForConditionalGeneration, a 48-layer dense 11.96B-parameter model with hybrid sliding-window + global attention, raw-patch image and raw-waveform audio projection, and 256K context.

Quantization tool

  • NVIDIA Model Optimizer (formerly TensorRT Model Optimizer). The PTQ (post-training quantization) library used to convert the BF16 weights and activations to NVFP4. Version used: 0.44.0. The library is part of NVIDIA's inference optimization stack and is integrated with vLLM, SGLang, TensorRT-LLM, and the Megatron training frameworks.

Calibration data

  • abisee/cnn_dailymail — Abigail See, Peter J. Liu, Christopher D. Manning. Get To The Point: Summarization with Pointer-Generator Networks. arXiv:1704.04368, 2017. ~300,000 unique English news articles from CNN and the Daily Mail. Licensed under Apache 2.0. This is the de-facto standard calibration set for NVIDIA's NVFP4 checkpoints (used for nvidia/Gemma-4-31B-IT-NVFP4 and most other NVIDIA-published NVFP4 models).

Prior art (the patterns we adapted)

  • bg-digitalservices/quantize_gemma4_moe.py — the quantization script that this work adapts. The 6-step pipeline (load → apply exclusion → calibrate → quantize → export → copy auxiliary files) is borrowed directly. The MoE plugin classes are removed because the 12B Unified is dense (no MoE). The multimodal exclusion pattern is the intellectual seed of the exclusion list below.
  • bg-digitalservices/Gemma-4-26B-A4B-it-NVFP4 — the published Gemma 4 26B MoE NVFP4 checkpoint that demonstrated ModelOpt NVFP4 + vLLM is viable.

Inference engine (planned)

  • vLLM — the target inference engine. vLLM 0.22.0+ natively supports modelopt_fp4 quantization via --quantization modelopt_fp4. We are working on a custom model registration for Gemma4UnifiedForConditionalGeneration.

Model loading and multimodal processor

  • Hugging Face transformers (≥ 5.10.0.dev0) — the loader for Gemma4UnifiedForConditionalGeneration via AutoModelForImageTextToText and the multimodal processor via AutoProcessor.

Quantization format

  • NVFP4 — NVIDIA's 4-bit floating-point format designed for FP4 weights with FP8 E4M3 per-block scales and a FP32 per-tensor global scale. Specified in hf_quant_config.json as quant_algo: NVFP4.

Model overview

Property Value
Base model google/gemma-4-12B-it
Architecture Gemma4UnifiedForConditionalGeneration (encoder-free multimodal)
Parameters 11.96B total
Active parameters 11.96B (dense, no MoE)
Context length 256K tokens (config)
Modalities Text, Image, Audio
Vocabulary 262,144 tokens
Layers 48 (40 sliding-window + 8 global attention)
Hidden size 3,840
Intermediate size 15,360
Attention heads 16 query, 8 KV (head_dim 256 sliding, 512 global)
Sliding window 1,024 tokens (5:1 sliding:global ratio)
Positional encoding Standard RoPE (sliding) + Proportional RoPE (global)
Multimodal design Raw image patches and audio waveforms are projected into the LLM embedding space via small linear layers (no separate vision/audio encoders)
Quantization NVFP4 (W4A4), NVIDIA Model Optimizer v0.44.0
Quantized layers All LLM attention (Q, K, O) + LLM MLP (gate, up, down) = 11.0B params
Excluded layers Vision embedder, vision projection, audio projection, vocab embedding, all norms, per-layer scalars = 1.0B params (mostly the vocab embedding)
Compression BF16 23.95 GB → NVFP4 8.28 GB (2.89× smaller)
Tensor types FP4 (weights) + FP8 E4M3 (per-block scales) + FP32 (per-tensor global scale) + BF16 (excluded layers)

What's quantized vs preserved

Quantized to NVFP4 (W4A4, FP4 weights and FP4 activations)

  • model.language_model.layers.{0-47}.self_attn.q_proj.weight (48 tensors)
  • model.language_model.layers.{0-47}.self_attn.k_proj.weight (no separate v_proj; attention_k_eq_v=True means V is a copy of K)
  • model.language_model.layers.{0-47}.self_attn.o_proj.weight (48 tensors)
  • model.language_model.layers.{0-47}.mlp.gate_proj.weight (48 tensors)
  • model.language_model.layers.{0-47}.mlp.up_proj.weight (48 tensors)
  • model.language_model.layers.{0-47}.mlp.down_proj.weight (48 tensors)

Total quantized: 328 weight tensors (~11.0B params).

Preserved in BF16 (excluded from quantization)

Module Tensors Reason
model.embed_vision.* (patch_dense, patch_ln1, patch_ln2, pos_norm, pos_embedding) 9 Patch tokenizer — high numerical sensitivity
model.embed_vision.embedding_projection.weight 1 Vision→LLM projection (6912→3840)
model.embed_audio.embedding_projection.weight 1 Audio→LLM projection (640→3840)
model.language_model.embed_tokens.weight 1 Vocab embedding [262144, 3840]; 262144 not a clean multiple of 16 (NVFP4 block size)
model.language_model.layers.*.layer_scalar 48 Per-layer scalar (1D)
model.language_model.layers.*.input_layernorm.weight 48 RMS norm (1D)
model.language_model.layers.*.post_attention_layernorm.weight 48 RMS norm (1D)
model.language_model.layers.*.pre_feedforward_layernorm.weight 48 RMS norm (1D)
model.language_model.layers.*.post_feedforward_layernorm.weight 48 RMS norm (1D)
model.language_model.layers.*.self_attn.k_norm.weight 48 RMS norm on K (1D)
model.language_model.layers.*.self_attn.q_norm.weight 48 RMS norm on Q (1D)
model.language_model.norm.weight 1 Final norm (1D)

The full exclusion list is in hf_quant_config.json:

"exclude_modules": [
  "lm_head",
  "model.embed_audio*",
  "model.embed_vision*"
]

ModelOpt's default config also excludes norms, biases, and the vocab embedding; the three lines above are the modelopt-specific additions.

Calibration details

  • Calibration set: abisee/cnn_dailymail (3.0.0)
  • Number of samples: 512 (text-only forward pass)
  • Sequence length: 1,024 tokens
  • Batch size: 4
  • Forward loop: model(input_ids=batch) only
  • Why text-only calibration: the multimodal pipeline (vision embedder + projection, audio projection) is excluded from quantization, so the calibration data does not need to be multimodal. This is the same approach used by all NVIDIA-published NVFP4 checkpoints.

Quantization config (exact)

The hf_quant_config.json in this repo records:

{
  "producer": {"name": "modelopt", "version": "0.44.0"},
  "quant_method": "modelopt_fp4",
  "quantization": {
    "quant_algo": "NVFP4",
    "kv_cache_quant_algo": null,
    "group_size": 16,
    "exclude_modules": [
      "lm_head",
      "model.embed_audio*",
      "model.embed_vision*"
    ]
  }
}

Quality (deferred to v0.1)

We have not run a full PPL or benchmark comparison in this v0 release. The expected behaviour based on NVIDIA's publicly published NVFP4 model cards (e.g. nvidia/Gemma-4-31B-IT-NVFP4, which reports 0.2–0.4pp loss across GPQA Diamond, AIME 2025, MMLU Pro, LiveCodeBench, Scicode, and Terminal-Bench Hard) is that NVFP4 retains

99% of BF16 accuracy. The 12B Unified is a different architecture than the 31B, so we do not claim parity; a wikitext-2 PPL comparison and a small multimodal smoke test are planned for v0.1.

How to use

With transformers (for direct use / research)

import torch
from transformers import AutoModelForImageTextToText, AutoProcessor

model_id = "r0b0tlab/gemma-4-12B-it-nvfp4"
model = AutoModelForImageTextToText.from_pretrained(
    model_id, dtype=torch.bfloat16, device_map="auto")
processor = AutoProcessor.from_pretrained(model_id)

# Text
msgs = [{"role": "user", "content": [{"type": "text",
          "text": "What is the capital of France?"}]}]
inputs = processor.apply_chat_template(
    msgs, tokenize=True, return_dict=True, return_tensors="pt",
    add_generation_prompt=True).to(model.device)
output = model.generate(**inputs, max_new_tokens=64, do_sample=False)
print(processor.decode(output[0][inputs.input_ids.shape[-1]:],
                       skip_special_tokens=True))

Caveat: this loads the BF16 base architecture. Loading the NVFP4 packed weights requires an engine with FP4 support. See "Engine support" above.

With vLLM (planned, not yet working)

# The command we expect to work once the engine is fixed:
vllm serve r0b0tlab/gemma-4-12B-it-nvfp4 \
  --quantization modelopt_fp4 \
  --tensor-parallel-size 1 \
  --max-model-len 65536 \
  --gpu-memory-utilization 0.85

Model lineage

google/gemma-4-12B (Base, BF16)
    └── google/gemma-4-12B-it (Instruction-tuned, BF16)
            └── r0b0tlab/gemma-4-12B-it-nvfp4 (this model, NVFP4)

License

This is a derived work.

  • Base model: Gemma 4 12B IT, © Google DeepMind, licensed under the Gemma Terms of Use and the Apache License 2.0.
  • Quantization: © 2026 r0b0tlab (@mr-r0b0t). The quantization script and configuration choices are released under Apache 2.0.
  • Calibration data: CNN/Daily Mail, © Abigail See et al., licensed under Apache 2.0.
  • Distributed under: Apache License 2.0.

Notes and limitations

  • Engine support is incomplete. See the status table at the top of this card. v0 ships the quantization artifact only; v0.1 will ship with a working engine path and benchmark numbers.
  • Multimodal sub-modules are preserved in BF16. The vision embedder (35M), vision projection (15M), and audio projection (~2.5M) are not quantized. This is a conservative choice; quantizing them would save < 100 MB and we judged the numerical risk of degrading multimodal understanding unacceptable.
  • Calibration is text-only. Following the NVIDIA NVFP4 standard, the calibration forward loop is text-only.
  • No fine-tuning was performed. This is a pure PTQ (post-training quantization) checkpoint; no QAT (quantization-aware training) or LoRA adapters are included.
  • Hardware requirements. NVFP4 requires an NVIDIA GPU with native FP4 tensor-core execution. On GPUs without native FP4, the engine will fall back to an emulation backend which is significantly slower.

How to cite this model

@misc{r0b0tlab_gemma4_12b_nvfp4_2026,
  title={Gemma 4 12B IT NVFP4 (r0b0tlab native optimization, v0)},
  author={r0b0tlab},
  year={2026},
  howpublished={Hugging Face},
  note={NVFP4 quantization of google/gemma-4-12B-it via NVIDIA Model Optimizer v0.44.0},
  url={https://huggingface.co/r0b0tlab/gemma-4-12B-it-nvfp4}
}

@misc{google_gemma4_12b_2026,
  title={Gemma 4 12B (Unified)},
  author={Google DeepMind},
  year={2026},
  howpublished={Hugging Face},
  url={https://huggingface.co/google/gemma-4-12B}
}

@software{nvidia_modelopt_2026,
  title={TensorRT Model Optimizer},
  author={NVIDIA},
  year={2026},
  url={https://github.com/NVIDIA/TensorRT-Model-Optimizer}
}

@misc{cnn_dailymail_2017,
  title={Get To The Point: Summarization with Pointer-Generator Networks},
  author={Abigail See and Peter J. Liu and Christopher D. Manning},
  year={2017},
  eprint={1704.04368},
  archivePrefix={arXiv},
  url={https://arxiv.org/abs/1704.04368}
}

Contact

Downloads last month
2,230
Safetensors
Model size
7B params
Tensor type
BF16
·
F8_E4M3
·
U8
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for r0b0tlab/gemma-4-12B-it-nvfp4

Quantized
(133)
this model

Paper for r0b0tlab/gemma-4-12B-it-nvfp4