gemma-4-A4B-98e-v7-coder — NVFP4A16 (vLLM)
NVFP4A16 quantization of ManniX-ITA/gemma-4-A4B-98e-v7-coder-it, the science-augmented code prune of Gemma 4 26B-A4B (128→98 experts/layer, ~20.8B). 2 shards, ~13.4 GB — fits a single 16 GB GPU with room for KV-cache.
NVFP4A16 = 4-bit NVFP4 weights with FP8 (e4m3) block scales and bf16
activations (A16). Quantized with NVIDIA
TensorRT-Model-Optimizer (modelopt)
main (0.45.0.dev), whose _QuantFusedExperts plugin handles Gemma 4's fused MoE
experts. Every exported weight/scale tensor is finiteness-verified before upload.
Serve with vLLM
python -m vllm.entrypoints.openai.api_server \
--model ManniX-ITA/gemma-4-A4B-98e-v7-coder-NVFP4A16 \
--served-model-name v7-coder \
--port 8000 \
--gpu-memory-utilization 0.92 \
--max-model-len 65536 \
--max-num-batched-tokens 8192 \
--dtype bfloat16 \
--trust-remote-code \
--reasoning-parser gemma4 \
--default-chat-template-kwargs '{"enable_thinking": true}'
Gemma 4 gotchas (all required): --max-num-batched-tokens 8192 (the MM-encoder
budget; the default 2048 < max_tokens_per_mm_item and crashes at boot),
--max-model-len 65536 (templates may request up to ~49k gen tokens), and the
gemma4 reasoning parser for the thinking format. NVFP4 needs a Blackwell/Hopper
or Ada GPU on a recent vLLM.
Benchmarks
NVFP4A16 is a deployment format and is not separately benchmarked (cohort
policy). The table is the cohort's same-host Q6_K · llama.cpp · greedy
reference (temperature 0.0, top_p 1.0, top_k 0), read from summary.json;
it is representative of this model's quality. Row-max in bold. This repo =
v7-coder.
| Benchmark | 128e (unpruned) | v6-coder | v7-coder | v7-coderx |
|---|---|---|---|---|
| GPQA-diamond (198q) | 67.17 | 61.11 | 70.71 | 48.48 |
| AIME (30q) | 73.33 | 56.67 | 76.67 | 70.00 |
| MATH500 (100q) | 92.00 | 89.00 | 92.00 | 89.00 |
| GSM8K (100q) | 89.00 | 88.00 | 93.00 | 91.00 |
| ARC-Challenge (full) | 96.50 | 95.39 | 94.80 | 94.28 |
| IFEval (100q, strict) | 97.00 | 92.00 | 95.00 | 95.00 |
| HumanEval (164) | 97.56 | 98.17 | 98.78 | 95.73 |
| HumanEval+ (164) | 92.07 | 92.68 | 92.68 | 92.68 |
| LCB-medium-55 | 96.36 | 92.73 | 96.36 | 98.18 |
| LCB-medium-100 | 97.00 | 94.00 | 97.00 | 99.00 |
| MultiPL-E (100) | 90.00 | 89.00 | 88.67 | 90.00 |
Metrics: GPQA & GSM8K = exact_match flexible-extract · MATH500 = math_verify ·
ARC & AIME = exact_match · IFEval = prompt_level_strict_acc · HumanEval/+ = pass@1
chat-extract · LCB-55/100 & MultiPL-E = pass@1. 128e uses the lcb_medium_55/100
templates; the prunes use lcb_medium_*_v4 (corrected harness, equivalent task).
F16 — unquantized precision reference
The same 11 benches on the unquantized F16 GGUF (llama.cpp, greedy, same host; LCB caps are intrinsic — a 24k-budget retry moved no cell). F16 (~41 GB) is a precision anchor, not a shipped tier — deploy the quants above. v6-coder has no F16 run. Row-max across the three F16 columns in bold; ↓ = below that model's own Q6_K.
| Benchmark | 128e (F16) | v7-coder (F16) | v7-coderx (F16) |
|---|---|---|---|
| GPQA-diamond (198q) | 71.21 | 73.74 | 51.01 |
| AIME (30q) | 76.67 | 76.67 | 76.67 |
| MATH500 (100q) | 94.00 | 91.00↓ | 94.00 |
| GSM8K (100q) | 91.00 | 91.00↓ | 90.00↓ |
| ARC-Challenge (full) | 96.42↓ | 94.71↓ | 93.60↓ |
| IFEval (100q, strict) | 97.00 | 96.00 | 92.00↓ |
| HumanEval (164) | 96.95↓ | 98.17↓ | 96.34 |
| HumanEval+ (164) | 90.85↓ | 92.68 | 90.24↓ |
| LCB-medium-55 | 90.91↓ | 92.73↓ | 90.91↓ |
| LCB-medium-100 | 92.00↓ | 93.00↓ | 92.00↓ |
| MultiPL-E (100) | 90.33 | 88.67 | 87.33↓ |
F16 = unquantized f16 GGUF, same greedy recipe and same host as the Q6_K table. A ↓ marks an F16 cell that scores below that model's own Q6_K — a small-margin greedy weight-precision divergence in which Q6_K mildly regularizes the decode tail. It is broadest and largest on v7-coderx (the most aggressively code-pruned variant: HE+, MultiPL-E, LCB) and is an intrinsic property, not a build defect — the F16 bins are bit-identical to fresh F16 rebuilt from the canonical bf16. Deploy and rank on the shipped quant (Q6_K and below), where v7-coderx ≥ v7-coder on code as designed; F16 is a reference only.
Recipe (summary)
98e prune from 128e via the fs2440 code recipe (generic_code 3×,
targeted_lcb_medium_55 2×, [24,40] per-layer floor, strategy=max +
breadth_bonus) plus targeted_gpqa 1.5×, then the mandatory shared-FFN α=1.2
upweight, then NVFP4A16 quantization. Full recipe, calibration-class table, and the
Q6_K provenance are on the
bf16 card.
Intended use & limitations
A compact, vLLM-deployable Gemma 4 checkpoint for agentic coding and code
reasoning that also needs solid graduate science — the science-augmented member
of the v7-coder cohort. For maximal raw coding throughput with science at baseline,
use v7-coderx.
A research prune, not an official Google release; generic_multilingual is
de-weighted (0×). GPQA/AIME/GSM8K parity-with-128e is measured on small benches
(30–198q) at greedy and carries run-to-run variance. For llama.cpp/CPU deployment
use the GGUF repo.
Lineage
128e → (v4 → v5 → v6-coder code line) → v7 competence-map rebuild → fs2440 code
floor → +targeted_gpqa 1.5× = v7-coder → NVFP4A16. Built and evaluated on the
omnimergekit toolchain.
- Downloads last month
- 47
Model tree for ManniX-ITA/gemma-4-A4B-98e-v7-coder-NVFP4A16
Base model
google/gemma-4-26B-A4B