Instructions to use coolthor/gemma-4-12B-it-NVFP4A16 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use coolthor/gemma-4-12B-it-NVFP4A16 with Transformers:
# Load model directly from transformers import AutoProcessor, AutoModelForImageTextToText processor = AutoProcessor.from_pretrained("coolthor/gemma-4-12B-it-NVFP4A16") model = AutoModelForImageTextToText.from_pretrained("coolthor/gemma-4-12B-it-NVFP4A16") - Notebooks
- Google Colab
- Kaggle
Gemma 4 12B-it — NVFP4 weight-only (W4A16)
Self-quantized weight-only NVFP4 of google/gemma-4-12B-it — Google's encoder-free omni model (text + image + audio + video). Quantized and benchmarked on an NVIDIA DGX Spark (GB10, sm_121a).
TL;DR: 7.7 GB on disk (from 23 GB BF16), 24.9 tok/s on a GB10 via vLLM, and all four modalities still work.
Why weight-only (W4A16), not W4A4
The "obvious" full NVFP4 (W4A4 = weight + activation both 4-bit) is worse on every axis for this model:
- It breaks the multimodal capabilities — image/audio/video collapse to empty or garbled output. W4A4 quantizes activations using text-only calibration, so the image/audio embeddings are out-of-distribution and get clipped by the 4-bit activation range.
- It is also slightly slower (23.9 vs 24.9 tok/s) — on a bandwidth-bound dense model, the weight-only dequant-to-BF16 path beats the W4A4 path.
So this build is weight-only NVFP4 (NVFP4A16): 4-bit weights, BF16 activations. Omni intact, faster, same file size.
Benchmark (GB10 / DGX Spark, vLLM 0.22.1 native, single-stream decode, warm)
| Format | Disk | tok/s (EN/ZH) | Omni |
|---|---|---|---|
| BF16 | 23 GB | 7.7 | yes |
| FP8 dynamic | 13 GB | 15.9 | yes |
| NVFP4 W4A4 | 7.7 GB | 23.9 | broken |
| NVFP4 W4A16 (this) | 7.7 GB | 24.9 | yes |
Note: in plain
transformers(HF eager) all quantized formats run slower than BF16 because there is no native FP4/FP8 kernel — the speedups above are real only under vLLM.
Quantization recipe
llmcompressor, scheme NVFP4A16, basic pipeline (the sequential pipeline hits a UserDict tracing error on this brand-new arch). The ignore list must match vLLM's native module quantization or the model will not load:
QuantizationModifier(targets="Linear", scheme="NVFP4A16",
ignore=["lm_head", "re:.*embedding_projection.*"])
i.e. quantize the text tower + the vision patch_dense; keep lm_head and both embedding_projections in BF16.
Serving (vLLM)
Needs vLLM with native Gemma4UnifiedForConditionalGeneration (~0.22.x / main) and the TRITON_ATTN backend — Gemma 4 has heterogeneous head dims (head_dim 256 x 16 heads = 4096 != hidden 3840) that other attention backends mishandle:
VLLM_ATTENTION_BACKEND=TRITON_ATTN \
vllm serve coolthor/gemma-4-12B-it-NVFP4A16 --max-model-len 4096
Text and image serve through vLLM today; vLLM's generic multimodal wrapper is image-only for now, so full audio/video serving is pending upstream. All four modalities work through transformers.
Environment (exact versions — this model is version-sensitive)
This is a brand-new arch on a brand-new GPU, so the toolchain matters more than usual. The versions I actually ran:
| Component | Version | Why it matters |
|---|---|---|
| vLLM | 0.22.1rc1.dev124 (main, post-PR) |
Needs the native Gemma4UnifiedForConditionalGeneration class, which only landed around 0.22.x/main. On an older vLLM it falls back to the generic transformers backend, which mishandles Gemma 4's non-square attention and crashes on o_proj. |
| transformers | 5.10.1 |
First release that knows model_type: gemma4_unified. Older transformers can't even load the config. |
| torch | 2.11.0+cu130 |
The one that bit me. vLLM main pins torch==2.10, but its _C.abi3.so was compiled against 2.11+cu130 — installing the pinned 2.10 (and pip silently pulling the CPU wheel on arm64) gives an undefined symbol import error and a CPU-only build. Force-align: pip install --force-reinstall --no-deps torch==2.11.0 --index-url https://download.pytorch.org/whl/cu130. |
| compressed-tensors | bundled with above llmcompressor | Reads the NVFP4 weight format. |
| GPU / arch | DGX Spark GB10, sm_121a, CUDA 13.x |
The torch-ABI dance above is specific to building vLLM from source for sm_121. |
| Attention backend | VLLM_ATTENTION_BACKEND=TRITON_ATTN |
Required, not optional — see the head-dim note above. |
On a normal CUDA GPU (Hopper/Ada/Blackwell desktop) you don't need the torch-ABI overlay — that pain is specific to building vLLM from source for sm_121. A recent pip install vllm (with the native Gemma4Unified class) plus VLLM_ATTENTION_BACKEND=TRITON_ATTN is enough.
The exact quantization config is in recipe.yaml in this repo (scheme + ignore list), so you can reproduce the build.
Validation (GB10, transformers)
- Text: coherent EN + ZH.
- Image: accurately described a studio-podcast photo (animals, headphones, studio mics, "ON AIR" sign, laptops).
- Audio: transcribed a LibriSpeech clip — "Mr. Quilter is the apostle of the middle classes...".
- Video: correctly described a night-street clip.
Credits
- Base model:
google/gemma-4-12B-it(Apache 2.0, Google DeepMind) - Quantization:
llmcompressor+compressed-tensors - Quantized & benchmarked by coolthor on a DGX Spark (GB10)
Support
One-person effort on a single DGX Spark, no sponsor. If it saved you time, a coffee ☕ is appreciated.
- Downloads last month
- -