gemma-4-A4B-98e-v7-coder — NVFP4A16 (vLLM)

NVFP4A16 quantization of ManniX-ITA/gemma-4-A4B-98e-v7-coder-it, the science-augmented code prune of Gemma 4 26B-A4B (128→98 experts/layer, ~20.8B). 2 shards, ~13.4 GB — fits a single 16 GB GPU with room for KV-cache.

NVFP4A16 = 4-bit NVFP4 weights with FP8 (e4m3) block scales and bf16 activations (A16). Quantized with NVIDIA TensorRT-Model-Optimizer (modelopt) main (0.45.0.dev), whose _QuantFusedExperts plugin handles Gemma 4's fused MoE experts. Every exported weight/scale tensor is finiteness-verified before upload.

Serve with vLLM

python -m vllm.entrypoints.openai.api_server \
    --model ManniX-ITA/gemma-4-A4B-98e-v7-coder-NVFP4A16 \
    --served-model-name v7-coder \
    --port 8000 \
    --gpu-memory-utilization 0.92 \
    --max-model-len 65536 \
    --max-num-batched-tokens 8192 \
    --dtype bfloat16 \
    --trust-remote-code \
    --reasoning-parser gemma4 \
    --default-chat-template-kwargs '{"enable_thinking": true}'

Gemma 4 gotchas (all required): --max-num-batched-tokens 8192 (the MM-encoder budget; the default 2048 < max_tokens_per_mm_item and crashes at boot), --max-model-len 65536 (templates may request up to ~49k gen tokens), and the gemma4 reasoning parser for the thinking format. NVFP4 needs a Blackwell/Hopper or Ada GPU on a recent vLLM.

Benchmarks

NVFP4A16 is a deployment format and is not separately benchmarked (cohort policy). The table is the cohort's same-host Q6_K · llama.cpp · greedy reference (temperature 0.0, top_p 1.0, top_k 0), read from summary.json; it is representative of this model's quality. Row-max in bold. This repo = v7-coder.

Benchmark	128e (unpruned)	v6-coder	v7-coder	v7-coderx
GPQA-diamond (198q)	67.17	61.11	70.71	48.48
AIME (30q)	73.33	56.67	76.67	70.00
MATH500 (100q)	92.00	89.00	92.00	89.00
GSM8K (100q)	89.00	88.00	93.00	91.00
ARC-Challenge (full)	96.50	95.39	94.80	94.28
IFEval (100q, strict)	97.00	92.00	95.00	95.00
HumanEval (164)	97.56	98.17	98.78	95.73
HumanEval+ (164)	92.07	92.68	92.68	92.68
LCB-medium-55	96.36	92.73	96.36	98.18
LCB-medium-100	97.00	94.00	97.00	99.00
MultiPL-E (100)	90.00	89.00	88.67	90.00

_{Metrics: GPQA & GSM8K = exact_match flexible-extract · MATH500 = math_verify ·
ARC & AIME = exact_match · IFEval = prompt_level_strict_acc · HumanEval/+ = pass@1
chat-extract · LCB-55/100 & MultiPL-E = pass@1. 128e uses the lcb_medium_55/100
templates; the prunes use lcb_medium_*_v4 (corrected harness, equivalent task).}

F16 — unquantized precision reference

The same 11 benches on the unquantized F16 GGUF (llama.cpp, greedy, same host; LCB caps are intrinsic — a 24k-budget retry moved no cell). F16 (~41 GB) is a precision anchor, not a shipped tier — deploy the quants above. v6-coder has no F16 run. Row-max across the three F16 columns in bold; ↓ = below that model's own Q6_K.

Benchmark	128e (F16)	v7-coder (F16)	v7-coderx (F16)
GPQA-diamond (198q)	71.21	73.74	51.01
AIME (30q)	76.67	76.67	76.67
MATH500 (100q)	94.00	91.00↓	94.00
GSM8K (100q)	91.00	91.00↓	90.00↓
ARC-Challenge (full)	96.42↓	94.71↓	93.60↓
IFEval (100q, strict)	97.00	96.00	92.00↓
HumanEval (164)	96.95↓	98.17↓	96.34
HumanEval+ (164)	90.85↓	92.68	90.24↓
LCB-medium-55	90.91↓	92.73↓	90.91↓
LCB-medium-100	92.00↓	93.00↓	92.00↓
MultiPL-E (100)	90.33	88.67	87.33↓

_{F16 = unquantized f16 GGUF, same greedy recipe and same host as the Q6_K table. A ↓ marks an F16 cell that scores below that model's own Q6_K — a small-margin greedy weight-precision divergence in which Q6_K mildly regularizes the decode tail. It is broadest and largest on v7-coderx (the most aggressively code-pruned variant: HE+, MultiPL-E, LCB) and is an intrinsic property, not a build defect — the F16 bins are bit-identical to fresh F16 rebuilt from the canonical bf16. Deploy and rank on the shipped quant (Q6_K and below), where v7-coderx ≥ v7-coder on code as designed; F16 is a reference only.}

Recipe (summary)

98e prune from 128e via the fs2440 code recipe (generic_code 3×, targeted_lcb_medium_55 2×, [24,40] per-layer floor, strategy=max + breadth_bonus) plus targeted_gpqa 1.5×, then the mandatory shared-FFN α=1.2 upweight, then NVFP4A16 quantization. Full recipe, calibration-class table, and the Q6_K provenance are on the bf16 card.

Intended use & limitations

A compact, vLLM-deployable Gemma 4 checkpoint for agentic coding and code reasoning that also needs solid graduate science — the science-augmented member of the v7-coder cohort. For maximal raw coding throughput with science at baseline, use v7-coderx. A research prune, not an official Google release; generic_multilingual is de-weighted (0×). GPQA/AIME/GSM8K parity-with-128e is measured on small benches (30–198q) at greedy and carries run-to-run variance. For llama.cpp/CPU deployment use the GGUF repo.

Lineage

128e → (v4 → v5 → v6-coder code line) → v7 competence-map rebuild → fs2440 code floor → +targeted_gpqa 1.5× = v7-coder → NVFP4A16. Built and evaluated on the omnimergekit toolchain.

Downloads last month: 47

Safetensors

Model size

11B params

Tensor type

BF16

F8_E4M3

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for ManniX-ITA/gemma-4-A4B-98e-v7-coder-NVFP4A16

Base model

google/gemma-4-26B-A4B

Finetuned

google/gemma-4-26B-A4B-it

Finetuned

ManniX-ITA/gemma-4-A4B-98e-v7-coder-it

Quantized

(2)

this model