Instructions to use AxionML/Gemma-4-12B-NVFP4 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use AxionML/Gemma-4-12B-NVFP4 with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("image-text-to-text", model="AxionML/Gemma-4-12B-NVFP4") messages = [ { "role": "user", "content": [ {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"}, {"type": "text", "text": "What animal is on the candy?"} ] }, ] pipe(text=messages)# Load model directly from transformers import AutoProcessor, AutoModelForImageTextToText processor = AutoProcessor.from_pretrained("AxionML/Gemma-4-12B-NVFP4") model = AutoModelForImageTextToText.from_pretrained("AxionML/Gemma-4-12B-NVFP4") messages = [ { "role": "user", "content": [ {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"}, {"type": "text", "text": "What animal is on the candy?"} ] }, ] inputs = processor.apply_chat_template( messages, add_generation_prompt=True, tokenize=True, return_dict=True, return_tensors="pt", ).to(model.device) outputs = model.generate(**inputs, max_new_tokens=40) print(processor.decode(outputs[0][inputs["input_ids"].shape[-1]:])) - TensorRT
How to use AxionML/Gemma-4-12B-NVFP4 with TensorRT:
# No code snippets available yet for this library. # To use this model, check the repository files and the library's documentation. # Want to help? PRs adding snippets are welcome at: # https://github.com/huggingface/huggingface.js
- Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- vLLM
How to use AxionML/Gemma-4-12B-NVFP4 with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "AxionML/Gemma-4-12B-NVFP4" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "AxionML/Gemma-4-12B-NVFP4", "messages": [ { "role": "user", "content": [ { "type": "text", "text": "Describe this image in one sentence." }, { "type": "image_url", "image_url": { "url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg" } } ] } ] }'Use Docker
docker model run hf.co/AxionML/Gemma-4-12B-NVFP4
- SGLang
How to use AxionML/Gemma-4-12B-NVFP4 with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "AxionML/Gemma-4-12B-NVFP4" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "AxionML/Gemma-4-12B-NVFP4", "messages": [ { "role": "user", "content": [ { "type": "text", "text": "Describe this image in one sentence." }, { "type": "image_url", "image_url": { "url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg" } } ] } ] }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "AxionML/Gemma-4-12B-NVFP4" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "AxionML/Gemma-4-12B-NVFP4", "messages": [ { "role": "user", "content": [ { "type": "text", "text": "Describe this image in one sentence." }, { "type": "image_url", "image_url": { "url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg" } } ] } ] }' - Docker Model Runner
How to use AxionML/Gemma-4-12B-NVFP4 with Docker Model Runner:
docker model run hf.co/AxionML/Gemma-4-12B-NVFP4
AxionML Gemma-4-12B-NVFP4
Developed by AxionML for open-source serving and deployment use cases. Part of AxionML's effort to provide ready-to-serve quantized models for the community.
This is an NVFP4-quantized version of google/gemma-4-12B-it (11.95B params). It follows NVIDIA's own dense-Gemma-4 NVFP4 recipe (nvidia/Gemma-4-31B-IT-NVFP4): the MLP / feed-forward linear layers are quantized to NVFP4, while attention is kept in BF16. Gemma's attention activations carry large per-channel outliers that 4-bit activation quantization cannot represent — so, exactly as NVIDIA does for Gemma-4, only the FFN is taken to FP4. The result is lossless on GSM8K while shrinking the model from ~24 GB (BF16) to ~11 GB.
Quantization Details
This model was quantized by applying NVFP4 to the weights and activations of the MLP (feed-forward) linear operators within the transformer blocks. Attention (q/k/v/o) is kept in BF16. The KV-cache is quantized to FP8 (E4M3). Embeddings, lm_head, and the multimodal (vision/audio) embedders are kept in their original BF16 precision.
| Quantization format | NVFP4 — MLP-only (W4A4 on FFN, attention BF16), MSE weight calibration |
| Weight micro-block / group size | 16 (FP8 E4M3 block scales + per-tensor FP32 global) |
| KV-cache | FP8 (E4M3), calibrated |
| Calibration dataset | cnn_dailymail + nvidia/Nemotron-Post-Training-Dataset-v2 (ModelOpt cnn_nemotron_v2_mix default, 2048 samples) |
| Quantized checkpoint size | ~11 GB (vs ~24 GB BF16) |
| Tool | NVIDIA TensorRT Model Optimizer (0.45.0.dev158+gf9423c0d3, built from source) |
| Target hardware | Blackwell (B100/B200/B300, sm_100/103/120) — native FP4 Tensor Cores |
Usage
Deploy with SGLang
Requires the SGLang branch in SGLang support below (transformers≥5.10 multimodal weight-name handling for Gemma-4; the FP8 sister additionally needs its fp8_pb_wo block-FP8 support).
sglang serve --model-path AxionML/Gemma-4-12B-NVFP4 \
--quantization modelopt_fp4 \
--kv-cache-dtype fp8_e4m3 \
--reasoning-parser gemma4 \
--tool-call-parser gemma4 \
--mem-fraction-static 0.85 \
--host 0.0.0.0 --port 30000
Speculative decoding (MTP / NEXTN)
Multi-Token Prediction with the paired google/gemma-4-12B-it-assistant draft
works on this quantized target with the SGLang branch below. Use the Triton
attention backend and load the draft unquantized:
sglang serve --model-path AxionML/Gemma-4-12B-NVFP4 \
--quantization modelopt_fp4 \
--kv-cache-dtype fp8_e4m3 \
--attention-backend triton \
--speculative-algorithm NEXTN \
--speculative-draft-model-path google/gemma-4-12B-it-assistant \
--speculative-draft-model-quantization unquant \
--speculative-num-steps 5 --speculative-num-draft-tokens 6 --speculative-eagle-topk 1 \
--reasoning-parser gemma4 --tool-call-parser gemma4 \
--mem-fraction-static 0.85 --host 0.0.0.0 --port 30000
MTP is lossless on GSM8K (see Accuracy). Earlier SGLang mis-loaded
ModelOpt's attention-projection scales (self_attn.{k,v}_proj.{k,v}_scale) as
the RadixAttention KV-cache scales, which corrupted the spec-decode verify
forward on quantized targets (degenerate output) while BF16 targets were fine.
The branch fix leaves gemma-4's KV scales at their identity default (1.0) —
correct, because gemma-4 writes K/V to the cache after q/k-norm and RoPE, so
the projection-output scales are the wrong descale factor. (The related
trtllm_mha SWA-pool crash,
sgl-project/sglang#26957,
is already fixed on main.)
Sampling defaults for Gemma 4: temperature=1.0, top_p=0.95, top_k=64. Thinking mode is off by default; enable with extra_body={"chat_template_kwargs": {"enable_thinking": True}}.
Smoke test:
curl http://localhost:30000/v1/chat/completions -H "Content-Type: application/json" -d '{
"model": "default",
"messages": [{"role": "user", "content": "What is C. elegans?"}],
"temperature": 1.0, "top_p": 0.95, "top_k": 64, "max_tokens": 256
}'
Reproduce with ModelOpt
python examples/llm_ptq/hf_ptq.py \
--pyt_ckpt_path google/gemma-4-12B-it \
--qformat nvfp4_mlp_only \
--weight_calib_algorithm mse \
--kv_cache_qformat fp8 \
--export_path ./gemma-4-12B-it-NVFP4 \
--trust_remote_code
(--weight_calib_algorithm mse is a small local addition to ModelOpt's hf_ptq.py that overrides the qformat's weight calibration to MSE; ModelOpt's stock NVFP4 uses max.)
About NVFP4
NVFP4 on Blackwell couples a compact E2M1 FP4 codebook with blockwise FP8 (E4M3) scaling over 16-element micro-blocks, so that 4-bit stored values stay numerically useful. The E2M1 codebook provides a small, nonuniform set of representable magnitudes up to ±6 and relies on saturating behavior rather than IEEE NaN/Inf encodings to maximize usable range per bit. Using an FP8 block scale (rather than power-of-two-only E8M0) enables fractional scales and error-minimizing scale selection (e.g. "map max to 6" vs "map max to 4 with clipping"). On Blackwell Tensor Cores, native FP4 multipliers exploit E2M1 simplicity to shrink multiplier area while FP32 accumulation protects dot-product accuracy.
Why MLP-only on Gemma-4: unlike Llama/Qwen (where NVIDIA quantizes all linears to NVFP4), gemma-4's attention input — the residual stream — carries persistent per-channel activation outliers far larger than ±6×block-scale, so 4-bit activation quantization of q/k/v/o collapses the normal channels and destroys output quality. NVIDIA's shipped Gemma-4-31B-IT-NVFP4 and Gemma-4-26B-A4B-NVFP4 both keep attention BF16 and quantize only the FFN/experts; this checkpoint applies that same recipe to the dense 12B. The MLP is calibrated with MSE (sweeping the per-block scale to minimize ‖W − dequant(quant(W))‖²) rather than max-of-abs, for tighter weight tails.
About FP8 (sister checkpoint)
A companion AxionML/Gemma-4-12B-FP8 ships an FP8 variant: per-block 128×128 weight-only FP8 (E4M3) with bf16 activations and an FP8 KV-cache, MSE-calibrated. Weight-only is deliberate — for the same activation-outlier reason, per-tensor W8A8 (quantized activations) also degrades on gemma-4, so the FP8 model leaves activations in bf16 and quantizes only the weights. It serves on Hopper (H100/H200) and Blackwell via --quantization modelopt_fp8.
Accuracy
GSM8K (1319 questions, sgl-eval, greedy, served on SGLang):
| Model | GSM8K |
|---|---|
google/gemma-4-12B-it (BF16) |
0.9636 |
| AxionML/Gemma-4-12B-NVFP4 (MLP-only, MSE) | 0.9612 |
| AxionML/Gemma-4-12B-NVFP4 + MTP (NEXTN) | 0.9644 |
| AxionML/Gemma-4-12B-FP8 (weight-only, MSE) | 0.9666 |
| AxionML/Gemma-4-12B-FP8 + MTP (NEXTN) | 0.9598 |
MTP (greedy, exact verify) is lossless within GSM8K run-to-run noise — accuracy holds with and without speculative decoding.
Performance (SPEED-Bench)
Latency/throughput measured with NVIDIA AIPerf on the nvidia/SPEED-Bench qualitative split (all 11 domains, 880 prompts each issued once, shuffle / seed 42), greedy, output capped at 512 tokens, OpenAI chat + streaming, one Blackwell GPU, served on the SGLang branch below. Prompts are short (ISL ≈ 145, OSL ≈ 410 tokens). MTP uses the google/gemma-4-12B-it-assistant NEXTN draft.
Concurrency 1 — single-stream latency (the low-latency serving regime):
| Config | TTFT (ms) | ITL (ms) | tok/s/user | accept len |
|---|---|---|---|---|
gemma-4-12B-it BF16 |
19.4 | 6.47 | 154.6 | — |
| NVFP4 | 33.7 | 5.32 | 188.0 | — |
| NVFP4 + MTP | 32.5 | 3.10 | 337.1 | 3.50 |
- NVFP4 vs BF16: 1.22× single-stream tokens/s (memory-bandwidth-bound — the 11 GB weight footprint wins; quant adds a little TTFT).
- MTP on NVFP4: 1.79× tokens/s, ITL 1.72× lower (accept length 3.50 of 6 draft tokens).
- NVFP4 + MTP vs BF16 baseline: ≈ 2.18× single-stream tokens/s.
Concurrency 32 — throughput (saturated / compute-bound):
| Config | agg tok/s | req/s | TTFT (ms) | accept len |
|---|---|---|---|---|
gemma-4-12B-it BF16 |
3250 | 7.8 | 36 | — |
| NVFP4 | 2701 | 7.1 | 60 | — |
| NVFP4 + MTP | 3090 | 7.1 | 111 | 3.12 |
At saturation the GPU is compute-bound, so NVFP4's weight-only savings don't beat BF16 dense GEMM on aggregate throughput (0.83×), and MTP recovers some of it (1.14× over NVFP4 no-MTP). Takeaway: NVFP4 — especially with MTP — pays off most in the low-concurrency / latency-bound regime; at saturation, throughput is comparable across formats.
SGLang support
Gemma 4 (including the encoder-free unified 12B) is supported on SGLang main. Serving ModelOpt-quantized Gemma-4 additionally needs the branch below, which (1) remaps the embed_vision.* multimodal weight names emitted by a transformers≥5.10 ModelOpt re-export back to SGLang's vision_embedder.* / embed_vision.embedding_projection layout, and (2) adds fp8_pb_wo block-FP8 weight loading (used by the companion FP8 checkpoint). It also fixes speculative decoding (NEXTN/MTP) on quantized targets: SGLang must not load ModelOpt's attention-projection scales (self_attn.{k,v}_proj.{k,v}_scale) as the RadixAttention KV-cache {k,v}_scale — gemma-4 caches K/V post-norm/post-RoPE, so those are the wrong descale factor and corrupt the spec verify forward; the KV scales correctly default to 1.0.
# Editable install of the branch
git clone https://github.com/bzhng-development/sglang.git
cd sglang && git checkout gemma4-modelopt-ptq
pip install -e python
# transformers with Gemma 4 (encoder-free unified) support
pip install 'git+https://github.com/huggingface/transformers.git@1423d22f7a3b62e8c70ad67b58ec25cd9b675897'
Branch: bzhng-development/sglang@gemma4-modelopt-ptq (off sgl-project/sglang main).
Run with Docker (SGLang nightly)
Serving needs the SGLang branch, so base it on a recent SGLang nightly image (lmsysorg/sglang:nightly-dev-YYYYMMDD-<hash>; cu13 variants exist for CUDA-13 hosts). The nightly already installs SGLang as an editable install rooted at /sgl-workspace/sglang, so the command below simply swaps that directory for the branch checkout — no reinstall needed — then pins the matching transformers, fetches the checkpoint, and starts the server, which will then be listening at http://0.0.0.0:30000 (change --port to use a different port):
docker run --gpus all --shm-size=128g --network=host \
-v ~/.cache/huggingface:/root/.cache/huggingface \
-e HF_TOKEN=$HF_TOKEN \
lmsysorg/sglang:nightly-dev-20260604-14ed9b44 \
bash -lc '
cd / && rm -rf /sgl-workspace/sglang &&
git clone https://github.com/bzhng-development/sglang.git /sgl-workspace/sglang &&
cd /sgl-workspace/sglang && git checkout gemma4-modelopt-ptq &&
pip install "git+https://github.com/huggingface/transformers.git@1423d22f7a3b62e8c70ad67b58ec25cd9b675897" &&
python -m sglang.launch_server \
--model-path AxionML/Gemma-4-12B-NVFP4 \
--quantization modelopt_fp4 \
--kv-cache-dtype fp8_e4m3 \
--reasoning-parser gemma4 --tool-call-parser gemma4 \
--mem-fraction-static 0.85 \
--host 0.0.0.0 --port 30000
'
--network=hostpublishes the server on the host's port 30000; alternatively drop it and use-p 30000:30000.- For MTP / NEXTN, append the speculative flags from the Speculative decoding section above to the
launch_serverline (HF_TOKENis then required — the draftgoogle/gemma-4-12B-it-assistantis gated). - The leading
cd /matters: the image's default workdir is/sgl-workspace/sglang, sorm -rf-ing it from inside that directory makesgitfail with "Unable to read current working directory." - Any newer
lmsysorg/sglang:nightly-dev-*tag also works — each ships the same editable/sgl-workspace/sglanglayout this relies on. - libnvidia-ml.so: you may or may not need to mount the host NVML library — only if
nvidia-smiinside the container reports a driver/library version mismatch. If so, add a mount matching your host driver (e.g.580.82.07):-v /usr/lib/x86_64-linux-gnu/libnvidia-ml.so.580.82.07:/usr/lib/x86_64-linux-gnu/libnvidia-ml.so.1:ro
ModelOpt install (editable, from source)
git clone https://github.com/NVIDIA/TensorRT-Model-Optimizer.git
cd TensorRT-Model-Optimizer && pip install -e ".[hf]" # commit f9423c0d3
Limitations
The base model was trained on data that may contain toxic language and societal biases. The quantized model inherits these limitations and may generate inaccurate, biased, or offensive content. Quantization can introduce additional deviations from the base model's behavior. Please refer to the original model card for full details.
Base model
google/gemma-4-12B-it is Google DeepMind's dense 11.95B-parameter Gemma 4 "Unified" (encoder-free) multimodal instruction-tuned model: text + image (+ audio) input, 256K context, hybrid sliding-window/global attention, configurable thinking mode, and native function calling. See the upstream card for full architecture, training data, evaluation, and responsible-AI details. This repository changes only the numeric precision of the weights — all capabilities, the chat template, and the tokenizer are inherited unchanged.
- Downloads last month
- 1,108