Instructions to use r0b0tlab/gemma-4-12B-it-nvfp4 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use r0b0tlab/gemma-4-12B-it-nvfp4 with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("image-text-to-text", model="r0b0tlab/gemma-4-12B-it-nvfp4")

# Load model directly
from transformers import AutoProcessor, AutoModelForMultimodalLM

processor = AutoProcessor.from_pretrained("r0b0tlab/gemma-4-12B-it-nvfp4")
model = AutoModelForMultimodalLM.from_pretrained("r0b0tlab/gemma-4-12B-it-nvfp4")

Notebooks
Google Colab
Kaggle
Local Apps Settings

vLLM

How to use r0b0tlab/gemma-4-12B-it-nvfp4 with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "r0b0tlab/gemma-4-12B-it-nvfp4"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "r0b0tlab/gemma-4-12B-it-nvfp4",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Use Docker

docker model run hf.co/r0b0tlab/gemma-4-12B-it-nvfp4

SGLang

How to use r0b0tlab/gemma-4-12B-it-nvfp4 with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "r0b0tlab/gemma-4-12B-it-nvfp4" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "r0b0tlab/gemma-4-12B-it-nvfp4",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "r0b0tlab/gemma-4-12B-it-nvfp4" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "r0b0tlab/gemma-4-12B-it-nvfp4",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Docker Model Runner
How to use r0b0tlab/gemma-4-12B-it-nvfp4 with Docker Model Runner:
```
docker model run hf.co/r0b0tlab/gemma-4-12B-it-nvfp4
```

Gemma 4 12B IT NVFP4 — r0b0tlab v0 release

v0 — quantization artifact, no engine-side verification yet. This release contains the NVFP4 (W4A4) quantization of google/gemma-4-12B-it produced with NVIDIA Model Optimizer. The artifact is complete and self-consistent; we have not yet verified a full inference-engine run end-to-end on this checkpoint (see "Engine support" below). A v0.1 follow-up will ship with throughput, latency, and wikitext-2 perplexity numbers once the engine side is wired up.

Engine support (status as of 2026-06-03)

Engine	Status
`transformers` (≥ dev main)	Loads the BF16 base model. Cannot load NVFP4 packed weights (uint8 FP4).
vLLM (≥ 0.22.0)	Blocked: `Gemma4UnifiedForConditionalGeneration` is not in vLLM's model registry; it falls back to `TransformersMultiModalForCausalLM` which crashes inside `flashinfer_scaled_fp4_mm` with a 3D→2D activation shape mismatch. We are working on a custom registry registration.
SGLang (dev image)	Blocked: same registry gap as vLLM, plus a deeper issue — SGLang's `Gemma4DecoderLayer` does not match the 12B Unified's full-attention layer shape (`head_dim=512`, no `v_proj` because `attention_k_eq_v=True`).
TensorRT-LLM	Not yet evaluated.
llama.cpp / GGUF	Not yet evaluated.

Practical advice right now: if you want to use this checkpoint, the cleanest path is to load it in transformers (dev main) and dequantize the FP4 weights to BF16 yourself, then run inference. This loses the speed benefit of FP4 but lets you validate the model. A v0.1 follow-up will publish a working engine path.

Credits and Attribution

This checkpoint was produced by r0b0tlab (@mr-r0b0t on X). It is derived work built on top of the following projects, models, datasets, and tools — all of which deserve direct credit:

Base model

google/gemma-4-12B-it — Google DeepMind. The Gemma 4 12B Unified instruction-tuned multimodal model. The architecture is Gemma4UnifiedForConditionalGeneration, a 48-layer dense 11.96B-parameter model with hybrid sliding-window + global attention, raw-patch image and raw-waveform audio projection, and 256K context.

Quantization tool

NVIDIA Model Optimizer (formerly TensorRT Model Optimizer). The PTQ (post-training quantization) library used to convert the BF16 weights and activations to NVFP4. Version used: 0.44.0. The library is part of NVIDIA's inference optimization stack and is integrated with vLLM, SGLang, TensorRT-LLM, and the Megatron training frameworks.

Calibration data

abisee/cnn_dailymail — Abigail See, Peter J. Liu, Christopher D. Manning. Get To The Point: Summarization with Pointer-Generator Networks. arXiv:1704.04368, 2017. ~300,000 unique English news articles from CNN and the Daily Mail. Licensed under Apache 2.0. This is the de-facto standard calibration set for NVIDIA's NVFP4 checkpoints (used for nvidia/Gemma-4-31B-IT-NVFP4 and most other NVIDIA-published NVFP4 models).

Prior art (the patterns we adapted)

bg-digitalservices/quantize_gemma4_moe.py — the quantization script that this work adapts. The 6-step pipeline (load → apply exclusion → calibrate → quantize → export → copy auxiliary files) is borrowed directly. The MoE plugin classes are removed because the 12B Unified is dense (no MoE). The multimodal exclusion pattern is the intellectual seed of the exclusion list below.
bg-digitalservices/Gemma-4-26B-A4B-it-NVFP4 — the published Gemma 4 26B MoE NVFP4 checkpoint that demonstrated ModelOpt NVFP4 + vLLM is viable.

Inference engine (planned)

vLLM — the target inference engine. vLLM 0.22.0+ natively supports modelopt_fp4 quantization via --quantization modelopt_fp4. We are working on a custom model registration for Gemma4UnifiedForConditionalGeneration.

Model loading and multimodal processor

Hugging Face transformers (≥ 5.10.0.dev0) — the loader for Gemma4UnifiedForConditionalGeneration via AutoModelForImageTextToText and the multimodal processor via AutoProcessor.

Quantization format

NVFP4 — NVIDIA's 4-bit floating-point format designed for FP4 weights with FP8 E4M3 per-block scales and a FP32 per-tensor global scale. Specified in hf_quant_config.json as quant_algo: NVFP4.

Model overview

Property	Value
Base model	`google/gemma-4-12B-it`
Architecture	`Gemma4UnifiedForConditionalGeneration` (encoder-free multimodal)
Parameters	11.96B total
Active parameters	11.96B (dense, no MoE)
Context length	256K tokens (config)
Modalities	Text, Image, Audio
Vocabulary	262,144 tokens
Layers	48 (40 sliding-window + 8 global attention)
Hidden size	3,840
Intermediate size	15,360
Attention heads	16 query, 8 KV (head_dim 256 sliding, 512 global)
Sliding window	1,024 tokens (5:1 sliding:global ratio)
Positional encoding	Standard RoPE (sliding) + Proportional RoPE (global)
Multimodal design	Raw image patches and audio waveforms are projected into the LLM embedding space via small linear layers (no separate vision/audio encoders)
Quantization	NVFP4 (W4A4), NVIDIA Model Optimizer v0.44.0
Quantized layers	All LLM attention (Q, K, O) + LLM MLP (gate, up, down) = 11.0B params
Excluded layers	Vision embedder, vision projection, audio projection, vocab embedding, all norms, per-layer scalars = 1.0B params (mostly the vocab embedding)
Compression	BF16 23.95 GB → NVFP4 8.28 GB (2.89× smaller)
Tensor types	FP4 (weights) + FP8 E4M3 (per-block scales) + FP32 (per-tensor global scale) + BF16 (excluded layers)

What's quantized vs preserved

Quantized to NVFP4 (W4A4, FP4 weights and FP4 activations)

model.language_model.layers.{0-47}.self_attn.q_proj.weight (48 tensors)
model.language_model.layers.{0-47}.self_attn.k_proj.weight (no separate v_proj; attention_k_eq_v=True means V is a copy of K)
model.language_model.layers.{0-47}.self_attn.o_proj.weight (48 tensors)
model.language_model.layers.{0-47}.mlp.gate_proj.weight (48 tensors)
model.language_model.layers.{0-47}.mlp.up_proj.weight (48 tensors)
model.language_model.layers.{0-47}.mlp.down_proj.weight (48 tensors)

Total quantized: 328 weight tensors (~11.0B params).

Preserved in BF16 (excluded from quantization)

Module	Tensors	Reason
`model.embed_vision.*` (patch_dense, patch_ln1, patch_ln2, pos_norm, pos_embedding)	9	Patch tokenizer — high numerical sensitivity
`model.embed_vision.embedding_projection.weight`	1	Vision→LLM projection (6912→3840)
`model.embed_audio.embedding_projection.weight`	1	Audio→LLM projection (640→3840)
`model.language_model.embed_tokens.weight`	1	Vocab embedding [262144, 3840]; 262144 not a clean multiple of 16 (NVFP4 block size)
`model.language_model.layers.*.layer_scalar`	48	Per-layer scalar (1D)
`model.language_model.layers.*.input_layernorm.weight`	48	RMS norm (1D)
`model.language_model.layers.*.post_attention_layernorm.weight`	48	RMS norm (1D)
`model.language_model.layers.*.pre_feedforward_layernorm.weight`	48	RMS norm (1D)
`model.language_model.layers.*.post_feedforward_layernorm.weight`	48	RMS norm (1D)
`model.language_model.layers.*.self_attn.k_norm.weight`	48	RMS norm on K (1D)
`model.language_model.layers.*.self_attn.q_norm.weight`	48	RMS norm on Q (1D)
`model.language_model.norm.weight`	1	Final norm (1D)

The full exclusion list is in hf_quant_config.json:

"exclude_modules": [
  "lm_head",
  "model.embed_audio*",
  "model.embed_vision*"
]

ModelOpt's default config also excludes norms, biases, and the vocab embedding; the three lines above are the modelopt-specific additions.

Calibration details

Calibration set: abisee/cnn_dailymail (3.0.0)
Number of samples: 512 (text-only forward pass)
Sequence length: 1,024 tokens
Batch size: 4
Forward loop: model(input_ids=batch) only
Why text-only calibration: the multimodal pipeline (vision embedder + projection, audio projection) is excluded from quantization, so the calibration data does not need to be multimodal. This is the same approach used by all NVIDIA-published NVFP4 checkpoints.

Quantization config (exact)

The hf_quant_config.json in this repo records:

{
  "producer": {"name": "modelopt", "version": "0.44.0"},
  "quant_method": "modelopt_fp4",
  "quantization": {
    "quant_algo": "NVFP4",
    "kv_cache_quant_algo": null,
    "group_size": 16,
    "exclude_modules": [
      "lm_head",
      "model.embed_audio*",
      "model.embed_vision*"
    ]
  }
}

Quality (deferred to v0.1)

We have not run a full PPL or benchmark comparison in this v0 release. The expected behaviour based on NVIDIA's publicly published NVFP4 model cards (e.g. nvidia/Gemma-4-31B-IT-NVFP4, which reports 0.2–0.4pp loss across GPQA Diamond, AIME 2025, MMLU Pro, LiveCodeBench, Scicode, and Terminal-Bench Hard) is that NVFP4 retains

99% of BF16 accuracy. The 12B Unified is a different architecture than the 31B, so we do not claim parity; a wikitext-2 PPL comparison and a small multimodal smoke test are planned for v0.1.

How to use

With transformers (for direct use / research)

import torch
from transformers import AutoModelForImageTextToText, AutoProcessor

model_id = "r0b0tlab/gemma-4-12B-it-nvfp4"
model = AutoModelForImageTextToText.from_pretrained(
    model_id, dtype=torch.bfloat16, device_map="auto")
processor = AutoProcessor.from_pretrained(model_id)

# Text
msgs = [{"role": "user", "content": [{"type": "text",
          "text": "What is the capital of France?"}]}]
inputs = processor.apply_chat_template(
    msgs, tokenize=True, return_dict=True, return_tensors="pt",
    add_generation_prompt=True).to(model.device)
output = model.generate(**inputs, max_new_tokens=64, do_sample=False)
print(processor.decode(output[0][inputs.input_ids.shape[-1]:],
                       skip_special_tokens=True))

Caveat: this loads the BF16 base architecture. Loading the NVFP4 packed weights requires an engine with FP4 support. See "Engine support" above.

With vLLM (planned, not yet working)

# The command we expect to work once the engine is fixed:
vllm serve r0b0tlab/gemma-4-12B-it-nvfp4 \
  --quantization modelopt_fp4 \
  --tensor-parallel-size 1 \
  --max-model-len 65536 \
  --gpu-memory-utilization 0.85

Model lineage

google/gemma-4-12B (Base, BF16)
    └── google/gemma-4-12B-it (Instruction-tuned, BF16)
            └── r0b0tlab/gemma-4-12B-it-nvfp4 (this model, NVFP4)

License

This is a derived work.

Quantization: © 2026 r0b0tlab (@mr-r0b0t). The quantization script and configuration choices are released under Apache 2.0.
Distributed under: Apache License 2.0.

Notes and limitations

Engine support is incomplete. See the status table at the top of this card. v0 ships the quantization artifact only; v0.1 will ship with a working engine path and benchmark numbers.
Multimodal sub-modules are preserved in BF16. The vision embedder (~~35M), vision projection (~~15M), and audio projection (~2.5M) are not quantized. This is a conservative choice; quantizing them would save < 100 MB and we judged the numerical risk of degrading multimodal understanding unacceptable.
Calibration is text-only. Following the NVIDIA NVFP4 standard, the calibration forward loop is text-only.
No fine-tuning was performed. This is a pure PTQ (post-training quantization) checkpoint; no QAT (quantization-aware training) or LoRA adapters are included.
Hardware requirements. NVFP4 requires an NVIDIA GPU with native FP4 tensor-core execution. On GPUs without native FP4, the engine will fall back to an emulation backend which is significantly slower.

How to cite this model

@misc{r0b0tlab_gemma4_12b_nvfp4_2026,
  title={Gemma 4 12B IT NVFP4 (r0b0tlab native optimization, v0)},
  author={r0b0tlab},
  year={2026},
  howpublished={Hugging Face},
  note={NVFP4 quantization of google/gemma-4-12B-it via NVIDIA Model Optimizer v0.44.0},
  url={https://huggingface.co/r0b0tlab/gemma-4-12B-it-nvfp4}
}

@misc{google_gemma4_12b_2026,
  title={Gemma 4 12B (Unified)},
  author={Google DeepMind},
  year={2026},
  howpublished={Hugging Face},
  url={https://huggingface.co/google/gemma-4-12B}
}

@software{nvidia_modelopt_2026,
  title={TensorRT Model Optimizer},
  author={NVIDIA},
  year={2026},
  url={https://github.com/NVIDIA/TensorRT-Model-Optimizer}
}

@misc{cnn_dailymail_2017,
  title={Get To The Point: Summarization with Pointer-Generator Networks},
  author={Abigail See and Peter J. Liu and Christopher D. Manning},
  year={2017},
  eprint={1704.04368},
  archivePrefix={arXiv},
  url={https://arxiv.org/abs/1704.04368}
}