gemma-4-A4B-98e-v7-coder — NVFP4A16 (vLLM)

NVFP4A16 quantization of ManniX-ITA/gemma-4-A4B-98e-v7-coder-it, the science-augmented code prune of Gemma 4 26B-A4B (128→98 experts/layer, ~20.8B). 2 shards, ~13.4 GB — fits a single 16 GB GPU with room for KV-cache.

NVFP4A16 = 4-bit NVFP4 weights with FP8 (e4m3) block scales and bf16 activations (A16). Quantized with NVIDIA TensorRT-Model-Optimizer (modelopt) main (0.45.0.dev), whose _QuantFusedExperts plugin handles Gemma 4's fused MoE experts. Every exported weight/scale tensor is finiteness-verified before upload.

Serve with vLLM

python -m vllm.entrypoints.openai.api_server \
    --model ManniX-ITA/gemma-4-A4B-98e-v7-coder-NVFP4A16 \
    --served-model-name v7-coder \
    --port 8000 \
    --gpu-memory-utilization 0.92 \
    --max-model-len 65536 \
    --max-num-batched-tokens 8192 \
    --dtype bfloat16 \
    --trust-remote-code \
    --reasoning-parser gemma4 \
    --default-chat-template-kwargs '{"enable_thinking": true}'

Gemma 4 gotchas (all required): --max-num-batched-tokens 8192 (the MM-encoder budget; the default 2048 < max_tokens_per_mm_item and crashes at boot), --max-model-len 65536 (templates may request up to ~49k gen tokens), and the gemma4 reasoning parser for the thinking format. NVFP4 needs a Blackwell/Hopper or Ada GPU on a recent vLLM.

Benchmarks

NVFP4A16 is a deployment format and is not separately benchmarked (cohort policy). The table is the cohort's same-host Q6_K · llama.cpp · greedy reference (temperature 0.0, top_p 1.0, top_k 0), read from summary.json; it is representative of this model's quality. Row-max in bold. This repo = v7-coder.

Benchmark 128e (unpruned) v6-coder v7-coder v7-coderx
GPQA-diamond (198q) 67.17 61.11 70.71 48.48
AIME (30q) 73.33 56.67 76.67 70.00
MATH500 (100q) 92.00 89.00 92.00 89.00
GSM8K (100q) 89.00 88.00 93.00 91.00
ARC-Challenge (full) 96.50 95.39 94.80 94.28
IFEval (100q, strict) 97.00 92.00 95.00 95.00
HumanEval (164) 97.56 98.17 98.78 95.73
HumanEval+ (164) 92.07 92.68 92.68 92.68
LCB-medium-55 96.36 92.73 96.36 98.18
LCB-medium-100 97.00 94.00 97.00 99.00
MultiPL-E (100) 90.00 89.00 88.67 90.00

Metrics: GPQA & GSM8K = exact_match flexible-extract · MATH500 = math_verify · ARC & AIME = exact_match · IFEval = prompt_level_strict_acc · HumanEval/+ = pass@1 chat-extract · LCB-55/100 & MultiPL-E = pass@1. 128e uses the lcb_medium_55/100 templates; the prunes use lcb_medium_*_v4 (corrected harness, equivalent task).

F16 — unquantized precision reference

The same 11 benches on the unquantized F16 GGUF (llama.cpp, greedy, same host; LCB caps are intrinsic — a 24k-budget retry moved no cell). F16 (~41 GB) is a precision anchor, not a shipped tier — deploy the quants above. v6-coder has no F16 run. Row-max across the three F16 columns in bold; = below that model's own Q6_K.

Benchmark 128e (F16) v7-coder (F16) v7-coderx (F16)
GPQA-diamond (198q) 71.21 73.74 51.01
AIME (30q) 76.67 76.67 76.67
MATH500 (100q) 94.00 91.00↓ 94.00
GSM8K (100q) 91.00 91.00 90.00↓
ARC-Challenge (full) 96.42 94.71↓ 93.60↓
IFEval (100q, strict) 97.00 96.00 92.00↓
HumanEval (164) 96.95↓ 98.17 96.34
HumanEval+ (164) 90.85↓ 92.68 90.24↓
LCB-medium-55 90.91↓ 92.73 90.91↓
LCB-medium-100 92.00↓ 93.00 92.00↓
MultiPL-E (100) 90.33 88.67 87.33↓

F16 = unquantized f16 GGUF, same greedy recipe and same host as the Q6_K table. A marks an F16 cell that scores below that model's own Q6_K — a small-margin greedy weight-precision divergence in which Q6_K mildly regularizes the decode tail. It is broadest and largest on v7-coderx (the most aggressively code-pruned variant: HE+, MultiPL-E, LCB) and is an intrinsic property, not a build defect — the F16 bins are bit-identical to fresh F16 rebuilt from the canonical bf16. Deploy and rank on the shipped quant (Q6_K and below), where v7-coderx ≥ v7-coder on code as designed; F16 is a reference only.

Recipe (summary)

98e prune from 128e via the fs2440 code recipe (generic_code 3×, targeted_lcb_medium_55 2×, [24,40] per-layer floor, strategy=max + breadth_bonus) plus targeted_gpqa 1.5×, then the mandatory shared-FFN α=1.2 upweight, then NVFP4A16 quantization. Full recipe, calibration-class table, and the Q6_K provenance are on the bf16 card.

Intended use & limitations

A compact, vLLM-deployable Gemma 4 checkpoint for agentic coding and code reasoning that also needs solid graduate science — the science-augmented member of the v7-coder cohort. For maximal raw coding throughput with science at baseline, use v7-coderx. A research prune, not an official Google release; generic_multilingual is de-weighted (0×). GPQA/AIME/GSM8K parity-with-128e is measured on small benches (30–198q) at greedy and carries run-to-run variance. For llama.cpp/CPU deployment use the GGUF repo.

Lineage

128e → (v4 → v5 → v6-coder code line) → v7 competence-map rebuild → fs2440 code floor → +targeted_gpqa 1.5× = v7-coder → NVFP4A16. Built and evaluated on the omnimergekit toolchain.

Downloads last month
47
Safetensors
Model size
11B params
Tensor type
BF16
·
F8_E4M3
·
U8
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for ManniX-ITA/gemma-4-A4B-98e-v7-coder-NVFP4A16

Quantized
(2)
this model