Gemma 4 A4B 98-Expert v7-coderx — code-maximal prune (~20.8B)

Eval complete (Q6_K / llama.cpp, greedy, same host). Every cell in the scoreboard is read from summary.json under the cohort-pinned greedy recipe (temperature 0.0, top_p 1.0, top_k 0). The 128e and v6-coder columns are the matching same-host Q6_K runs. The GGUF and NVFP4A16 formats are deployment targets and are not separately benchmarked (cohort policy) — the Q6_K column is representative.

Headline — the strongest coder in the cohort. v7-coderx spends its whole prune budget on code: LCB-medium-55 98.18% and LCB-medium-100 99.0% — the highest of any Gemma-4 prune to date and +1.8pp / +2.0pp past the unpruned 128e (96.36 / 97.0 on the same Q6_K run) — plus MultiPL-E 90.0%, HumanEval+ 92.68%, IFEval 95%. The trade is graduate science: GPQA-diamond sits at 48.48% (this recipe carries no targeted_gpqa term). If you need the science back without giving up the code profile, use the sibling v7-coder (GPQA 70.71%, LCB-55 96.36%).

A research checkpoint that prunes the unpruned Gemma 4 26B-A4B-it (128 experts/layer, top-8 + shared, 30 layers) down to 98 experts per layer. The fs2440 drop map concentrates protection on generic-code (3×) and LiveCodeBench-medium (2×) on a [24,40] per-layer floor, with no science or multilingual targeting — the code-maximal member of the v7-coder cohort. Same 98e shape, same router, same attention, same norms as the rest of the cohort, plus the mandatory shared-FFN α=1.2 upweight all coder variants carry.

Quantized formats

Format	Repo	Notes
bf16 (this repo)	`…-v7-coderx-it`	9 shards. fs2440 drop map + shared α=1.2.
GGUF (llama.cpp / ollama)	`…-v7-coderx-it-GGUF`	Bartowski tier sweep (imatrix K-quants) + ContribDynamic CD-* per-layer quants + F16 + `imatrix.dat` + `mmproj`.
NVFP4A16 (vLLM)	`…-v7-coderx-NVFP4A16`	Native vLLM 4-bit + FP8 block scales, via NVIDIA `modelopt` main (0.45.0.dev, `_QuantFusedExperts`). ~13 GB. Deployment format — not separately benchmarked.
Ollama	`mannix/gemma4-98e-v7-coderx`	`ollama pull mannix/gemma4-98e-v7-coderx:<tier>` (`:latest` = Q4_K_M; `:vision-<tier>` adds the SigLIP vision tower).

Benchmarks

Q6_K · llama.cpp · greedy (temperature 0.0, top_p 1.0, top_k 0), all four models scored on the same host from summary.json. Row-max in bold. This repo = v7-coderx.

Benchmark	128e (unpruned)	v6-coder	v7-coder	v7-coderx
GPQA-diamond (198q)	67.17	61.11	70.71	48.48
AIME (30q)	73.33	56.67	76.67	70.00
MATH500 (100q)	92.00	89.00	92.00	89.00
GSM8K (100q)	89.00	88.00	93.00	91.00
ARC-Challenge (full)	96.50	95.39	94.80	94.28
IFEval (100q, strict)	97.00	92.00	95.00	95.00
HumanEval (164)	97.56	98.17	98.78	95.73
HumanEval+ (164)	92.07	92.68	92.68	92.68
LCB-medium-55	96.36	92.73	96.36	98.18
LCB-medium-100	97.00	94.00	97.00	99.00
MultiPL-E (100)	90.00	89.00	88.67	90.00

_{Metrics: GPQA & GSM8K = exact_match flexible-extract · MATH500 = math_verify ·
ARC & AIME = exact_match · IFEval = prompt_level_strict_acc · HumanEval/+ = pass@1
chat-extract · LCB-55/100 & MultiPL-E = pass@1. 128e uses the lcb_medium_55/100
templates; the prunes use lcb_medium_*_v4 (corrected harness, equivalent task).}

Every code/instruction axis is at the top of the cohort; the budget is paid almost entirely on graduate science, which carries no protection term in this recipe.

Coder-field comparison — v7-coderx vs Qwen2.5-Coder-14B / 7B + Qwen3.5-9B (Q6_K, llama.cpp, greedy)

The 9 canonical benches + MultiPL-E-100, all on the identical llama.cpp Q6_K / greedy recipe (reasoning models served with --reasoning-format deepseek --reasoning-budget 12288 --parallel 2). Architectures differ — this is a same-harness comparison, not a same-class one:

v7-coderx — Gemma-4 26B-A4B MoE pruned to 98 experts (~20.8B total, ~A4B active), reasoning.
Qwen2.5-Coder-14B / 7B-Instruct — dense, non-reasoning code specialists (bartowski Q6_K).
Qwen3.5-9B — dense reasoning model (bartowski Q6_K).

Bench (n)	v7-coderx Q6_K	Qwen2.5-Coder-14B	Qwen2.5-Coder-7B	Qwen3.5-9B
ARC-Challenge-chat (1172)	94.28%	90.53%	85.58%	96.76%
GPQA Diamond flex (198)	48.48%	34.85%	26.26%	73.74%
GSM8K-100 flex	91.00%	89.00%	80.00%	79.00%
MATH-500-100 math_verify	89.00%	62.00%	66.00%	59.00%
AIME 2024 (30)	70.00%	10.00%	10.00%	56.67%
IFEval-100 (prompt_strict)	95.00%	68.00%	54.00%	93.00%
HumanEval-164 chat	95.73%	90.85%	87.20%	89.02%
HumanEval+-164 chat	92.68%	84.76% †	83.54%	80.49%
LCB-medium-55 v4	98.18%	18.18% †	12.73%	58.18%
MultiPL-E-100 (macro)	90.00%	84.67%	80.67%	80.33%

† Qwen2.5-Coder-14B HumanEval+ / LCB-medium-55 are the same-stack GGUF HE+ sweep numbers (not re-run in this chain). All Qwen cells are the same-host reference runs used on the v6-coder card — Qwen is a fixed reference, so the columns are identical across the cohort; only the Gemma column changes.

Note on Qwen3.5-9B. Qwen3.5-9B is a verbose, slow thinking model: it emits long <think> reasoning chains (often ≥1900 tokens even on a trivial GSM8K question), so it runs several× slower per question than the non-reasoning Qwen2.5-Coder models — well beyond what its 9B size would suggest. Its GSM8K / MATH-500 / GPQA cells were re-run after a harness fix (under batched, reasoning-parsed serving the verbose thinking intermittently left the final answer inside the reasoning block, mis-scored as empty content).

Answer-length analysis (anti-rumination)

The pruned reasoning model thinks with a bounded thinking_token_budget=12288; the question is whether that length is productive (long thinking that PASSes) or rumination (long thinking that fails). Per-problem completion length is measured from omk_eval token_stats (characters from the raw completion; tokens via the 128e tokenizer) on the real-n benches, against 128e and v6-coder on the same problems, same greedy Q6_K / llama.cpp stack.

Per-problem completion length — characters (p50 / p90 / max):

Bench (n)	128e	v6-coder	v7-coderx
GPQA Diamond (198)	2571/16136/27811	2582/16100/25243	2627/19582/40946
AIME 2024 (30)	1963/7748/8680	2141/7469/9433	2095/8987/12815
LCB-medium-55	3734/16430/36462	31015/36260/43278	30193/36297/41168
LCB-medium-100	2056/15467/48569	29384/35389/43633	29429/35973/41168
MultiPL-E-100 (300)	245/566/3353	245/573/2725	246/617/2933
MATH-500 (100)	1083/1873/7899	1089/2025/9236	1113/1953/8548
GSM8K (100)	294/746/25989	283/780/11378	274/779/13867
IFEval (100)	877/3755/8263	855/3489/20908	732/3210/6633
HumanEval (164)	698/1284/5354	711/1438/5954	743/1412/16967
HumanEval+ (164)	714/1461/3289	694/1390/5282	743/1359/3150
ARC-Challenge (1172)	1210/1633/6254	1221/1674/48886	1234/1720/54956

Per-problem completion length — tokens (p50 / p90 / max):

Bench (n)	128e	v6-coder	v7-coderx
GPQA Diamond (198)	843/8189/8189	879/8189/8189	890/8189/8189
AIME 2024 (30)	933/3994/4021	946/3993/4011	954/3997/4021
LCB-medium-55	1005/5622/16022	12818/13318/15976	12820/13163/15667
LCB-medium-100	542/5353/16022	12740/13212/15976	12735/13016/15667
MultiPL-E-100 (300)	84/171/1013	85/184/965	84/188/871
MATH-500 (100)	431/895/3377	424/863/3377	443/929/3337
GSM8K (100)	131/271/8853	129/276/4687	119/266/5128
IFEval (100)	219/850/1561	222/797/3898	177/768/1231
HumanEval (164)	226/431/1611	226/448/2084	236/440/5520
HumanEval+ (164)	226/455/996	224/437/2040	233/443/1332
ARC-Challenge (1172)	258/355/1417	259/365/16266	263/374/16276

Budget-saturation incidence — share of problems whose completion reached ≥12k tokens (at/near the thinking_token_budget=12288 cap). Saturation by itself is not rumination — a saturated output that PASSes is productive use of the budget; the pruned reasoning model saturates on nearly every LCB problem, 128e almost never does.

Bench (n)	128e	v6-coder	v7-coderx
LCB-medium-55	1 / 55 (1.8%)	54 / 55 (98.2%)	54 / 55 (98.2%)
LCB-medium-100	2 / 100 (2.0%)	98 / 100 (98.0%)	96 / 100 (96.0%)

Rumination — long thinking that fails to PASS. The right metric is not median length (128e looks short only because it answers easy problems fast). It is the share of the model's budget-saturated outputs that still fail — tokens burned without a correct answer:

Bench (n)	128e	v6-coder	v7-coderx
LCB-medium-55 — saturated-and-failed	1 / 1 (100.0%)	4 / 54 (7.4%)	1 / 54 (1.9%)
LCB-medium-100 — saturated-and-failed	2 / 2 (100.0%)	6 / 98 (6.1%)	1 / 96 (1.0%)
LCB-100 — mean completion tokens, PASS vs FAIL	1392 vs 13782	12698 vs 15051	12649 vs 13289

Key findings:

128e only thinks long when it is lost. Every 128e output that reaches the budget cap is a failure (1/1 on LCB-55, 2/2 on LCB-100), and its failed problems run several× longer than its passed ones (mean 13782 vs 1392 tok on LCB-100).
v7-coderx's long thinking is overwhelmingly productive. It saturates on ~96% of LCB-100 problems but only 1/96 of those saturated outputs fail (1.0%); its PASS and FAIL completions are nearly the same length (mean 12649 vs 13289 tok), so failures are not driven by extra rumination. On LCB-55 it is 1/54 saturated-and-failed.
At or below v6-coder's rumination rate. v6-coder ran 4/54 (LCB-55) and 6/98 (LCB-100) saturated-and-failed; v7-coderx matches or improves on both.
Non-LCB benches stay tight. On the short-answer benches (GSM8K / MATH-500 / HE / HE+ / MultiPL-E) p50/p90 length tracks 128e and v6-coder within a few tokens — the targeted prune did not trade length for accuracy on the everyday benches.

Methodology. Per-problem lengths come from omk_eval token_stats over each bench's samples_*.jsonl / lcb_result.samples.jsonl; saturation/PASS-FAIL is computed per problem from completion_tokens + passed. MultiPL-E measures code length, not reasoning (its samples store only the final code block, no <think> trace), so it is a code-conciseness reference rather than a thinking-length signal.

At a glance

	128e (base)	v7-coderx	v7-coder (sibling)
Total params	~26B	~20.8B	~20.8B
Active / token	~4B (top-8 + shared)	~4B	~4B
Experts / layer	128	98 (30 dropped)	98 (30 dropped)
Per-layer floor	—	[24, 40]	[24, 40]
Science targeting	—	off	`targeted_gpqa` 1.5×
Shared FFN α	1.0	1.2 (`mlp.down_proj`)	1.2
Built from	—	128e original (fresh prune)	128e original

Recipe

The drop map is produced by generate_drop_map_v5.py (omnimergekit) from per-expert, per-class contribution scores on the rebuilt v7 competence maps (expert_neuron_v7_code_gpqa.json — 10 classes, audited producers, multilingual category included), then applied with expert_drop.py, then the shared expert is upweighted.

1. fs2440 base recipe

target        = 98          # 30 experts/layer dropped
protect_top   = 16          # 16 highest-scoring experts/layer never dropped
alpha         = 2.0         # contribution sharpening exponent
strategy      = max         # per-expert score = MAX over classes (not mean/geomean)
normalize     = rank        # rank-normalize within each (layer, class)
breadth_bonus = 0.5         # reward experts useful across many classes (anti-overfit)
v4_floor_map  = v4_layer_floor_map_v7.json     # per-layer keep floor
v4_floor_data = expert_neuron_base_v7.json
v4_floor_clamp = [24, 40]   # floor bounded into this band per layer
outlier_mode  = median      # clamp bf16 weight-norm artifacts to layer median
outlier_wnorm_thresh = 1e4
baseline      = teacher_force_98e_p16_clean.json   # tie-break anchor

strategy=max + breadth_bonus is the load-bearing pair — it favours experts strongly useful to at least one class and broadly useful across classes, the optimizer-off-manifold lesson encoded as a recipe. The [24,40] floor is the 98e-scaled analogue of the 62e [15,25] band that won the loop-floor study, and beats [20,35] by ~3.6pp LCB-55 for a coder.

2. Calibration class weights — code only

Ten contribution classes are scored; the weights steer which specialists survive. v7-coderx zeroes every non-code targeting term:

Class	v7-coderx	v7-coder
generic_math	1	1
generic_logic	1	1
generic_code	3	3
generic_science	1	1
generic_creative	1	1
generic_multilingual	0	0
targeted_humaneval	0	0
targeted_humanevalplus	0	0
targeted_lcb_medium_55	2	2
targeted_gpqa	0	1.5

HE/HE+ targeting is off because both already sit at/above the un-targeted baseline; the protection budget goes to LiveCodeBench-medium, the bench where pruning hurt most on earlier variants. v7-coderx is exactly v7-coder minus the targeted_gpqa term — the two share most of their keep set, which is why HE+ and IFEval match to the point and only LCB-55 (up) and GPQA (down) move.

3. Mandatory shared-FFN α=1.2 (cohort rule)

After expert drop, router_shared_upweight.py --alpha 1.2 --target mlp.down_proj.weight upweights Gemma 4's always-on shared FFN. Every coder variant carries this; omitting it yields the "weak / ruminating" pre-shared baseline and makes cross-variant comparison unfair. A .shared_applied marker records it.

Intended use

A compact (~12–13 GB at Q4_K_M / NVFP4A16, fits a single 12–16 GB GPU) Gemma 4 checkpoint for maximal coding throughput and instruction-following — the code-extreme (x) member of the v7-coder cohort. If your workload also needs strong graduate science, use v7-coder, which trades ~1.8pp LCB-55 for ~+22pp GPQA.

Inherits Gemma 4's thinking format — serve with the reasoning parser enabled (--reasoning-parser gemma4 on vLLM; --reasoning-format deepseek --reasoning-budget 8192 on llama-server).

Limitations

A research prune, not an official Google release. Expert pruning trades breadth for size: generic_multilingual is de-weighted (0×) and graduate science (GPQA) is the explicit budget axis — at 48.48% it is well below the unpruned 128e (67.17% on the same Q6_K run). Choose v7-coder if science matters. Quality below ~Q3 / 3-bit degrades on the Gemma 4 MoE — prefer Q4_K_M or higher for production. The GGUF and NVFP4A16 formats are provided for deployment but are not separately benchmarked.

Lineage

128e → (v4 → v5 → v6-coder code line) → v7 competence-map rebuild → fs2440 code floor = v7-coderx. Built and evaluated on the omnimergekit toolchain.

Downloads last month: 47

Safetensors

Model size

20B params

Tensor type

BF16

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for ManniX-ITA/gemma-4-A4B-98e-v7-coderx-it

Base model

google/gemma-4-26B-A4B

Finetuned

google/gemma-4-26B-A4B-it

Finetuned

(104)

this model

Quantizations

2 models