Instructions to use r0b0tlab/gemma-4-12B-it-nvfp4 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use r0b0tlab/gemma-4-12B-it-nvfp4 with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("image-text-to-text", model="r0b0tlab/gemma-4-12B-it-nvfp4")# Load model directly from transformers import AutoProcessor, AutoModelForMultimodalLM processor = AutoProcessor.from_pretrained("r0b0tlab/gemma-4-12B-it-nvfp4") model = AutoModelForMultimodalLM.from_pretrained("r0b0tlab/gemma-4-12B-it-nvfp4") - Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- vLLM
How to use r0b0tlab/gemma-4-12B-it-nvfp4 with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "r0b0tlab/gemma-4-12B-it-nvfp4" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "r0b0tlab/gemma-4-12B-it-nvfp4", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }'Use Docker
docker model run hf.co/r0b0tlab/gemma-4-12B-it-nvfp4
- SGLang
How to use r0b0tlab/gemma-4-12B-it-nvfp4 with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "r0b0tlab/gemma-4-12B-it-nvfp4" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "r0b0tlab/gemma-4-12B-it-nvfp4", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "r0b0tlab/gemma-4-12B-it-nvfp4" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "r0b0tlab/gemma-4-12B-it-nvfp4", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }' - Docker Model Runner
How to use r0b0tlab/gemma-4-12B-it-nvfp4 with Docker Model Runner:
docker model run hf.co/r0b0tlab/gemma-4-12B-it-nvfp4
- Gemma 4 12B IT NVFP4 — r0b0tlab v0 release
Gemma 4 12B IT NVFP4 — r0b0tlab v0 release
v0 — quantization artifact, no engine-side verification yet. This release contains the NVFP4 (W4A4) quantization of
google/gemma-4-12B-itproduced with NVIDIA Model Optimizer. The artifact is complete and self-consistent; we have not yet verified a full inference-engine run end-to-end on this checkpoint (see "Engine support" below). A v0.1 follow-up will ship with throughput, latency, and wikitext-2 perplexity numbers once the engine side is wired up.
Engine support (status as of 2026-06-03)
| Engine | Status |
|---|---|
transformers (≥ dev main) |
Loads the BF16 base model. Cannot load NVFP4 packed weights (uint8 FP4). |
| vLLM (≥ 0.22.0) | Blocked: Gemma4UnifiedForConditionalGeneration is not in vLLM's model registry; it falls back to TransformersMultiModalForCausalLM which crashes inside flashinfer_scaled_fp4_mm with a 3D→2D activation shape mismatch. We are working on a custom registry registration. |
| SGLang (dev image) | Blocked: same registry gap as vLLM, plus a deeper issue — SGLang's Gemma4DecoderLayer does not match the 12B Unified's full-attention layer shape (head_dim=512, no v_proj because attention_k_eq_v=True). |
| TensorRT-LLM | Not yet evaluated. |
| llama.cpp / GGUF | Not yet evaluated. |
Practical advice right now: if you want to use this
checkpoint, the cleanest path is to load it in
transformers (dev main) and dequantize the FP4 weights to
BF16 yourself, then run inference. This loses the speed
benefit of FP4 but lets you validate the model. A v0.1
follow-up will publish a working engine path.
Credits and Attribution
This checkpoint was produced by r0b0tlab (@mr-r0b0t on X). It is derived work built on top of the following projects, models, datasets, and tools — all of which deserve direct credit:
Base model
google/gemma-4-12B-it— Google DeepMind. The Gemma 4 12B Unified instruction-tuned multimodal model. The architecture isGemma4UnifiedForConditionalGeneration, a 48-layer dense 11.96B-parameter model with hybrid sliding-window + global attention, raw-patch image and raw-waveform audio projection, and 256K context.
Quantization tool
- NVIDIA Model Optimizer (formerly TensorRT Model Optimizer). The PTQ (post-training quantization) library used to convert the BF16 weights and activations to NVFP4. Version used: 0.44.0. The library is part of NVIDIA's inference optimization stack and is integrated with vLLM, SGLang, TensorRT-LLM, and the Megatron training frameworks.
Calibration data
abisee/cnn_dailymail— Abigail See, Peter J. Liu, Christopher D. Manning. Get To The Point: Summarization with Pointer-Generator Networks. arXiv:1704.04368, 2017. ~300,000 unique English news articles from CNN and the Daily Mail. Licensed under Apache 2.0. This is the de-facto standard calibration set for NVIDIA's NVFP4 checkpoints (used fornvidia/Gemma-4-31B-IT-NVFP4and most other NVIDIA-published NVFP4 models).
Prior art (the patterns we adapted)
bg-digitalservices/quantize_gemma4_moe.py— the quantization script that this work adapts. The 6-step pipeline (load → apply exclusion → calibrate → quantize → export → copy auxiliary files) is borrowed directly. The MoE plugin classes are removed because the 12B Unified is dense (no MoE). The multimodal exclusion pattern is the intellectual seed of the exclusion list below.bg-digitalservices/Gemma-4-26B-A4B-it-NVFP4— the published Gemma 4 26B MoE NVFP4 checkpoint that demonstrated ModelOpt NVFP4 + vLLM is viable.
Inference engine (planned)
- vLLM — the
target inference engine. vLLM 0.22.0+ natively supports
modelopt_fp4quantization via--quantization modelopt_fp4. We are working on a custom model registration forGemma4UnifiedForConditionalGeneration.
Model loading and multimodal processor
- Hugging Face
transformers(≥ 5.10.0.dev0) — the loader forGemma4UnifiedForConditionalGenerationviaAutoModelForImageTextToTextand the multimodal processor viaAutoProcessor.
Quantization format
- NVFP4 — NVIDIA's 4-bit floating-point format designed
for FP4 weights with FP8 E4M3 per-block scales and a
FP32 per-tensor global scale. Specified in
hf_quant_config.jsonasquant_algo: NVFP4.
Model overview
| Property | Value |
|---|---|
| Base model | google/gemma-4-12B-it |
| Architecture | Gemma4UnifiedForConditionalGeneration (encoder-free multimodal) |
| Parameters | 11.96B total |
| Active parameters | 11.96B (dense, no MoE) |
| Context length | 256K tokens (config) |
| Modalities | Text, Image, Audio |
| Vocabulary | 262,144 tokens |
| Layers | 48 (40 sliding-window + 8 global attention) |
| Hidden size | 3,840 |
| Intermediate size | 15,360 |
| Attention heads | 16 query, 8 KV (head_dim 256 sliding, 512 global) |
| Sliding window | 1,024 tokens (5:1 sliding:global ratio) |
| Positional encoding | Standard RoPE (sliding) + Proportional RoPE (global) |
| Multimodal design | Raw image patches and audio waveforms are projected into the LLM embedding space via small linear layers (no separate vision/audio encoders) |
| Quantization | NVFP4 (W4A4), NVIDIA Model Optimizer v0.44.0 |
| Quantized layers | All LLM attention (Q, K, O) + LLM MLP (gate, up, down) = 11.0B params |
| Excluded layers | Vision embedder, vision projection, audio projection, vocab embedding, all norms, per-layer scalars = 1.0B params (mostly the vocab embedding) |
| Compression | BF16 23.95 GB → NVFP4 8.28 GB (2.89× smaller) |
| Tensor types | FP4 (weights) + FP8 E4M3 (per-block scales) + FP32 (per-tensor global scale) + BF16 (excluded layers) |
What's quantized vs preserved
Quantized to NVFP4 (W4A4, FP4 weights and FP4 activations)
model.language_model.layers.{0-47}.self_attn.q_proj.weight(48 tensors)model.language_model.layers.{0-47}.self_attn.k_proj.weight(no separatev_proj;attention_k_eq_v=Truemeans V is a copy of K)model.language_model.layers.{0-47}.self_attn.o_proj.weight(48 tensors)model.language_model.layers.{0-47}.mlp.gate_proj.weight(48 tensors)model.language_model.layers.{0-47}.mlp.up_proj.weight(48 tensors)model.language_model.layers.{0-47}.mlp.down_proj.weight(48 tensors)
Total quantized: 328 weight tensors (~11.0B params).
Preserved in BF16 (excluded from quantization)
| Module | Tensors | Reason |
|---|---|---|
model.embed_vision.* (patch_dense, patch_ln1, patch_ln2, pos_norm, pos_embedding) |
9 | Patch tokenizer — high numerical sensitivity |
model.embed_vision.embedding_projection.weight |
1 | Vision→LLM projection (6912→3840) |
model.embed_audio.embedding_projection.weight |
1 | Audio→LLM projection (640→3840) |
model.language_model.embed_tokens.weight |
1 | Vocab embedding [262144, 3840]; 262144 not a clean multiple of 16 (NVFP4 block size) |
model.language_model.layers.*.layer_scalar |
48 | Per-layer scalar (1D) |
model.language_model.layers.*.input_layernorm.weight |
48 | RMS norm (1D) |
model.language_model.layers.*.post_attention_layernorm.weight |
48 | RMS norm (1D) |
model.language_model.layers.*.pre_feedforward_layernorm.weight |
48 | RMS norm (1D) |
model.language_model.layers.*.post_feedforward_layernorm.weight |
48 | RMS norm (1D) |
model.language_model.layers.*.self_attn.k_norm.weight |
48 | RMS norm on K (1D) |
model.language_model.layers.*.self_attn.q_norm.weight |
48 | RMS norm on Q (1D) |
model.language_model.norm.weight |
1 | Final norm (1D) |
The full exclusion list is in hf_quant_config.json:
"exclude_modules": [
"lm_head",
"model.embed_audio*",
"model.embed_vision*"
]
ModelOpt's default config also excludes norms, biases, and the vocab embedding; the three lines above are the modelopt-specific additions.
Calibration details
- Calibration set:
abisee/cnn_dailymail(3.0.0) - Number of samples: 512 (text-only forward pass)
- Sequence length: 1,024 tokens
- Batch size: 4
- Forward loop:
model(input_ids=batch)only - Why text-only calibration: the multimodal pipeline (vision embedder + projection, audio projection) is excluded from quantization, so the calibration data does not need to be multimodal. This is the same approach used by all NVIDIA-published NVFP4 checkpoints.
Quantization config (exact)
The hf_quant_config.json in this repo records:
{
"producer": {"name": "modelopt", "version": "0.44.0"},
"quant_method": "modelopt_fp4",
"quantization": {
"quant_algo": "NVFP4",
"kv_cache_quant_algo": null,
"group_size": 16,
"exclude_modules": [
"lm_head",
"model.embed_audio*",
"model.embed_vision*"
]
}
}
Quality (deferred to v0.1)
We have not run a full PPL or benchmark comparison in
this v0 release. The expected behaviour based on NVIDIA's
publicly published NVFP4 model cards (e.g.
nvidia/Gemma-4-31B-IT-NVFP4, which reports 0.2–0.4pp loss
across GPQA Diamond, AIME 2025, MMLU Pro, LiveCodeBench,
Scicode, and Terminal-Bench Hard) is that NVFP4 retains
99% of BF16 accuracy. The 12B Unified is a different architecture than the 31B, so we do not claim parity; a wikitext-2 PPL comparison and a small multimodal smoke test are planned for v0.1.
How to use
With transformers (for direct use / research)
import torch
from transformers import AutoModelForImageTextToText, AutoProcessor
model_id = "r0b0tlab/gemma-4-12B-it-nvfp4"
model = AutoModelForImageTextToText.from_pretrained(
model_id, dtype=torch.bfloat16, device_map="auto")
processor = AutoProcessor.from_pretrained(model_id)
# Text
msgs = [{"role": "user", "content": [{"type": "text",
"text": "What is the capital of France?"}]}]
inputs = processor.apply_chat_template(
msgs, tokenize=True, return_dict=True, return_tensors="pt",
add_generation_prompt=True).to(model.device)
output = model.generate(**inputs, max_new_tokens=64, do_sample=False)
print(processor.decode(output[0][inputs.input_ids.shape[-1]:],
skip_special_tokens=True))
Caveat: this loads the BF16 base architecture. Loading the NVFP4 packed weights requires an engine with FP4 support. See "Engine support" above.
With vLLM (planned, not yet working)
# The command we expect to work once the engine is fixed:
vllm serve r0b0tlab/gemma-4-12B-it-nvfp4 \
--quantization modelopt_fp4 \
--tensor-parallel-size 1 \
--max-model-len 65536 \
--gpu-memory-utilization 0.85
Model lineage
google/gemma-4-12B (Base, BF16)
└── google/gemma-4-12B-it (Instruction-tuned, BF16)
└── r0b0tlab/gemma-4-12B-it-nvfp4 (this model, NVFP4)
License
This is a derived work.
- Base model: Gemma 4 12B IT, © Google DeepMind, licensed under the Gemma Terms of Use and the Apache License 2.0.
- Quantization: © 2026 r0b0tlab (@mr-r0b0t). The quantization script and configuration choices are released under Apache 2.0.
- Calibration data: CNN/Daily Mail, © Abigail See et al., licensed under Apache 2.0.
- Distributed under: Apache License 2.0.
Notes and limitations
- Engine support is incomplete. See the status table at the top of this card. v0 ships the quantization artifact only; v0.1 will ship with a working engine path and benchmark numbers.
- Multimodal sub-modules are preserved in BF16. The
vision embedder (
35M), vision projection (15M), and audio projection (~2.5M) are not quantized. This is a conservative choice; quantizing them would save < 100 MB and we judged the numerical risk of degrading multimodal understanding unacceptable. - Calibration is text-only. Following the NVIDIA NVFP4 standard, the calibration forward loop is text-only.
- No fine-tuning was performed. This is a pure PTQ (post-training quantization) checkpoint; no QAT (quantization-aware training) or LoRA adapters are included.
- Hardware requirements. NVFP4 requires an NVIDIA GPU with native FP4 tensor-core execution. On GPUs without native FP4, the engine will fall back to an emulation backend which is significantly slower.
How to cite this model
@misc{r0b0tlab_gemma4_12b_nvfp4_2026,
title={Gemma 4 12B IT NVFP4 (r0b0tlab native optimization, v0)},
author={r0b0tlab},
year={2026},
howpublished={Hugging Face},
note={NVFP4 quantization of google/gemma-4-12B-it via NVIDIA Model Optimizer v0.44.0},
url={https://huggingface.co/r0b0tlab/gemma-4-12B-it-nvfp4}
}
@misc{google_gemma4_12b_2026,
title={Gemma 4 12B (Unified)},
author={Google DeepMind},
year={2026},
howpublished={Hugging Face},
url={https://huggingface.co/google/gemma-4-12B}
}
@software{nvidia_modelopt_2026,
title={TensorRT Model Optimizer},
author={NVIDIA},
year={2026},
url={https://github.com/NVIDIA/TensorRT-Model-Optimizer}
}
@misc{cnn_dailymail_2017,
title={Get To The Point: Summarization with Pointer-Generator Networks},
author={Abigail See and Peter J. Liu and Christopher D. Manning},
year={2017},
eprint={1704.04368},
archivePrefix={arXiv},
url={https://arxiv.org/abs/1704.04368}
}
Contact
- r0b0tlab on Hugging Face: huggingface.co/r0b0tlab
- @mr-r0b0t on X: @mr-r0b0t
- Issues / questions: open an issue on the HF repo or contact via X.
- Downloads last month
- 2,230