Gemma 4 A4B 98-Expert v7-coderx — code-maximal prune (~20.8B)
Eval complete (Q6_K / llama.cpp, greedy, same host). Every cell in the scoreboard is read from
summary.jsonunder the cohort-pinned greedy recipe (temperature 0.0,top_p 1.0,top_k 0). The 128e and v6-coder columns are the matching same-host Q6_K runs. The GGUF and NVFP4A16 formats are deployment targets and are not separately benchmarked (cohort policy) — the Q6_K column is representative.Headline — the strongest coder in the cohort. v7-coderx spends its whole prune budget on code: LCB-medium-55 98.18% and LCB-medium-100 99.0% — the highest of any Gemma-4 prune to date and +1.8pp / +2.0pp past the unpruned 128e (96.36 / 97.0 on the same Q6_K run) — plus MultiPL-E 90.0%, HumanEval+ 92.68%, IFEval 95%. The trade is graduate science: GPQA-diamond sits at 48.48% (this recipe carries no
targeted_gpqaterm). If you need the science back without giving up the code profile, use the sibling v7-coder (GPQA 70.71%, LCB-55 96.36%).
A research checkpoint that prunes the unpruned
Gemma 4 26B-A4B-it
(128 experts/layer, top-8 + shared, 30 layers) down to 98 experts per layer.
The fs2440 drop map concentrates protection on generic-code (3×) and
LiveCodeBench-medium (2×) on a [24,40] per-layer floor, with no science or
multilingual targeting — the code-maximal member of the v7-coder cohort. Same 98e
shape, same router, same attention, same norms as the rest of the cohort, plus the
mandatory shared-FFN α=1.2 upweight all coder variants carry.
Quantized formats
| Format | Repo | Notes |
|---|---|---|
| bf16 (this repo) | …-v7-coderx-it |
9 shards. fs2440 drop map + shared α=1.2. |
| GGUF (llama.cpp / ollama) | …-v7-coderx-it-GGUF |
Bartowski tier sweep (imatrix K-quants) + ContribDynamic CD-* per-layer quants + F16 + imatrix.dat + mmproj. |
| NVFP4A16 (vLLM) | …-v7-coderx-NVFP4A16 |
Native vLLM 4-bit + FP8 block scales, via NVIDIA modelopt main (0.45.0.dev, _QuantFusedExperts). ~13 GB. Deployment format — not separately benchmarked. |
| Ollama | mannix/gemma4-98e-v7-coderx |
ollama pull mannix/gemma4-98e-v7-coderx:<tier> (:latest = Q4_K_M; :vision-<tier> adds the SigLIP vision tower). |
Benchmarks
Q6_K · llama.cpp · greedy (temperature 0.0, top_p 1.0, top_k 0), all four
models scored on the same host from summary.json. Row-max in bold.
This repo = v7-coderx.
| Benchmark | 128e (unpruned) | v6-coder | v7-coder | v7-coderx |
|---|---|---|---|---|
| GPQA-diamond (198q) | 67.17 | 61.11 | 70.71 | 48.48 |
| AIME (30q) | 73.33 | 56.67 | 76.67 | 70.00 |
| MATH500 (100q) | 92.00 | 89.00 | 92.00 | 89.00 |
| GSM8K (100q) | 89.00 | 88.00 | 93.00 | 91.00 |
| ARC-Challenge (full) | 96.50 | 95.39 | 94.80 | 94.28 |
| IFEval (100q, strict) | 97.00 | 92.00 | 95.00 | 95.00 |
| HumanEval (164) | 97.56 | 98.17 | 98.78 | 95.73 |
| HumanEval+ (164) | 92.07 | 92.68 | 92.68 | 92.68 |
| LCB-medium-55 | 96.36 | 92.73 | 96.36 | 98.18 |
| LCB-medium-100 | 97.00 | 94.00 | 97.00 | 99.00 |
| MultiPL-E (100) | 90.00 | 89.00 | 88.67 | 90.00 |
Metrics: GPQA & GSM8K = exact_match flexible-extract · MATH500 = math_verify ·
ARC & AIME = exact_match · IFEval = prompt_level_strict_acc · HumanEval/+ = pass@1
chat-extract · LCB-55/100 & MultiPL-E = pass@1. 128e uses the lcb_medium_55/100
templates; the prunes use lcb_medium_*_v4 (corrected harness, equivalent task).
Every code/instruction axis is at the top of the cohort; the budget is paid almost entirely on graduate science, which carries no protection term in this recipe.
Coder-field comparison — v7-coderx vs Qwen2.5-Coder-14B / 7B + Qwen3.5-9B (Q6_K, llama.cpp, greedy)
The 9 canonical benches + MultiPL-E-100, all on the identical llama.cpp Q6_K / greedy
recipe (reasoning models served with --reasoning-format deepseek --reasoning-budget 12288 --parallel 2). Architectures differ — this is a same-harness comparison, not a same-class one:
- v7-coderx — Gemma-4 26B-A4B MoE pruned to 98 experts (~20.8B total, ~A4B active), reasoning.
- Qwen2.5-Coder-14B / 7B-Instruct — dense, non-reasoning code specialists (bartowski Q6_K).
- Qwen3.5-9B — dense reasoning model (bartowski Q6_K).
| Bench (n) | v7-coderx Q6_K | Qwen2.5-Coder-14B | Qwen2.5-Coder-7B | Qwen3.5-9B |
|---|---|---|---|---|
| ARC-Challenge-chat (1172) | 94.28% | 90.53% | 85.58% | 96.76% |
| GPQA Diamond flex (198) | 48.48% | 34.85% | 26.26% | 73.74% |
| GSM8K-100 flex | 91.00% | 89.00% | 80.00% | 79.00% |
| MATH-500-100 math_verify | 89.00% | 62.00% | 66.00% | 59.00% |
| AIME 2024 (30) | 70.00% | 10.00% | 10.00% | 56.67% |
| IFEval-100 (prompt_strict) | 95.00% | 68.00% | 54.00% | 93.00% |
| HumanEval-164 chat | 95.73% | 90.85% | 87.20% | 89.02% |
| HumanEval+-164 chat | 92.68% | 84.76% † | 83.54% | 80.49% |
| LCB-medium-55 v4 | 98.18% | 18.18% † | 12.73% | 58.18% |
| MultiPL-E-100 (macro) | 90.00% | 84.67% | 80.67% | 80.33% |
† Qwen2.5-Coder-14B HumanEval+ / LCB-medium-55 are the same-stack GGUF HE+ sweep numbers (not re-run in this chain). All Qwen cells are the same-host reference runs used on the v6-coder card — Qwen is a fixed reference, so the columns are identical across the cohort; only the Gemma column changes.
Note on Qwen3.5-9B. Qwen3.5-9B is a verbose, slow thinking model: it emits long
<think>reasoning chains (often ≥1900 tokens even on a trivial GSM8K question), so it runs several× slower per question than the non-reasoning Qwen2.5-Coder models — well beyond what its 9B size would suggest. Its GSM8K / MATH-500 / GPQA cells were re-run after a harness fix (under batched, reasoning-parsed serving the verbose thinking intermittently left the final answer inside the reasoning block, mis-scored as empty content).
Answer-length analysis (anti-rumination)
The pruned reasoning model thinks with a bounded thinking_token_budget=12288; the
question is whether that length is productive (long thinking that PASSes) or
rumination (long thinking that fails). Per-problem completion length is measured from
omk_eval token_stats (characters from the raw completion; tokens via the 128e tokenizer)
on the real-n benches, against 128e and v6-coder on the same problems, same greedy
Q6_K / llama.cpp stack.
Per-problem completion length — characters (p50 / p90 / max):
| Bench (n) | 128e | v6-coder | v7-coderx |
|---|---|---|---|
| GPQA Diamond (198) | 2571/16136/27811 | 2582/16100/25243 | 2627/19582/40946 |
| AIME 2024 (30) | 1963/7748/8680 | 2141/7469/9433 | 2095/8987/12815 |
| LCB-medium-55 | 3734/16430/36462 | 31015/36260/43278 | 30193/36297/41168 |
| LCB-medium-100 | 2056/15467/48569 | 29384/35389/43633 | 29429/35973/41168 |
| MultiPL-E-100 (300) | 245/566/3353 | 245/573/2725 | 246/617/2933 |
| MATH-500 (100) | 1083/1873/7899 | 1089/2025/9236 | 1113/1953/8548 |
| GSM8K (100) | 294/746/25989 | 283/780/11378 | 274/779/13867 |
| IFEval (100) | 877/3755/8263 | 855/3489/20908 | 732/3210/6633 |
| HumanEval (164) | 698/1284/5354 | 711/1438/5954 | 743/1412/16967 |
| HumanEval+ (164) | 714/1461/3289 | 694/1390/5282 | 743/1359/3150 |
| ARC-Challenge (1172) | 1210/1633/6254 | 1221/1674/48886 | 1234/1720/54956 |
Per-problem completion length — tokens (p50 / p90 / max):
| Bench (n) | 128e | v6-coder | v7-coderx |
|---|---|---|---|
| GPQA Diamond (198) | 843/8189/8189 | 879/8189/8189 | 890/8189/8189 |
| AIME 2024 (30) | 933/3994/4021 | 946/3993/4011 | 954/3997/4021 |
| LCB-medium-55 | 1005/5622/16022 | 12818/13318/15976 | 12820/13163/15667 |
| LCB-medium-100 | 542/5353/16022 | 12740/13212/15976 | 12735/13016/15667 |
| MultiPL-E-100 (300) | 84/171/1013 | 85/184/965 | 84/188/871 |
| MATH-500 (100) | 431/895/3377 | 424/863/3377 | 443/929/3337 |
| GSM8K (100) | 131/271/8853 | 129/276/4687 | 119/266/5128 |
| IFEval (100) | 219/850/1561 | 222/797/3898 | 177/768/1231 |
| HumanEval (164) | 226/431/1611 | 226/448/2084 | 236/440/5520 |
| HumanEval+ (164) | 226/455/996 | 224/437/2040 | 233/443/1332 |
| ARC-Challenge (1172) | 258/355/1417 | 259/365/16266 | 263/374/16276 |
Budget-saturation incidence — share of problems whose completion reached ≥12k tokens
(at/near the thinking_token_budget=12288 cap). Saturation by itself is not rumination —
a saturated output that PASSes is productive use of the budget; the pruned reasoning model
saturates on nearly every LCB problem, 128e almost never does.
| Bench (n) | 128e | v6-coder | v7-coderx |
|---|---|---|---|
| LCB-medium-55 | 1 / 55 (1.8%) | 54 / 55 (98.2%) | 54 / 55 (98.2%) |
| LCB-medium-100 | 2 / 100 (2.0%) | 98 / 100 (98.0%) | 96 / 100 (96.0%) |
Rumination — long thinking that fails to PASS. The right metric is not median length (128e looks short only because it answers easy problems fast). It is the share of the model's budget-saturated outputs that still fail — tokens burned without a correct answer:
| Bench (n) | 128e | v6-coder | v7-coderx |
|---|---|---|---|
| LCB-medium-55 — saturated-and-failed | 1 / 1 (100.0%) | 4 / 54 (7.4%) | 1 / 54 (1.9%) |
| LCB-medium-100 — saturated-and-failed | 2 / 2 (100.0%) | 6 / 98 (6.1%) | 1 / 96 (1.0%) |
| LCB-100 — mean completion tokens, PASS vs FAIL | 1392 vs 13782 | 12698 vs 15051 | 12649 vs 13289 |
Key findings:
- 128e only thinks long when it is lost. Every 128e output that reaches the budget cap is a failure (1/1 on LCB-55, 2/2 on LCB-100), and its failed problems run several× longer than its passed ones (mean 13782 vs 1392 tok on LCB-100).
- v7-coderx's long thinking is overwhelmingly productive. It saturates on ~96% of LCB-100 problems but only 1/96 of those saturated outputs fail (1.0%); its PASS and FAIL completions are nearly the same length (mean 12649 vs 13289 tok), so failures are not driven by extra rumination. On LCB-55 it is 1/54 saturated-and-failed.
- At or below v6-coder's rumination rate. v6-coder ran 4/54 (LCB-55) and 6/98 (LCB-100) saturated-and-failed; v7-coderx matches or improves on both.
- Non-LCB benches stay tight. On the short-answer benches (GSM8K / MATH-500 / HE / HE+ / MultiPL-E) p50/p90 length tracks 128e and v6-coder within a few tokens — the targeted prune did not trade length for accuracy on the everyday benches.
Methodology. Per-problem lengths come from omk_eval
token_statsover each bench'ssamples_*.jsonl/lcb_result.samples.jsonl; saturation/PASS-FAIL is computed per problem fromcompletion_tokens+passed. MultiPL-E measures code length, not reasoning (its samples store only the final code block, no<think>trace), so it is a code-conciseness reference rather than a thinking-length signal.
At a glance
| 128e (base) | v7-coderx | v7-coder (sibling) | |
|---|---|---|---|
| Total params | ~26B | ~20.8B | ~20.8B |
| Active / token | ~4B (top-8 + shared) | ~4B | ~4B |
| Experts / layer | 128 | 98 (30 dropped) | 98 (30 dropped) |
| Per-layer floor | — | [24, 40] | [24, 40] |
| Science targeting | — | off | targeted_gpqa 1.5× |
| Shared FFN α | 1.0 | 1.2 (mlp.down_proj) |
1.2 |
| Built from | — | 128e original (fresh prune) | 128e original |
Recipe
The drop map is produced by generate_drop_map_v5.py (omnimergekit) from
per-expert, per-class contribution scores on the rebuilt v7 competence maps
(expert_neuron_v7_code_gpqa.json — 10 classes, audited producers, multilingual
category included), then applied with expert_drop.py, then the shared expert is
upweighted.
1. fs2440 base recipe
target = 98 # 30 experts/layer dropped
protect_top = 16 # 16 highest-scoring experts/layer never dropped
alpha = 2.0 # contribution sharpening exponent
strategy = max # per-expert score = MAX over classes (not mean/geomean)
normalize = rank # rank-normalize within each (layer, class)
breadth_bonus = 0.5 # reward experts useful across many classes (anti-overfit)
v4_floor_map = v4_layer_floor_map_v7.json # per-layer keep floor
v4_floor_data = expert_neuron_base_v7.json
v4_floor_clamp = [24, 40] # floor bounded into this band per layer
outlier_mode = median # clamp bf16 weight-norm artifacts to layer median
outlier_wnorm_thresh = 1e4
baseline = teacher_force_98e_p16_clean.json # tie-break anchor
strategy=max + breadth_bonus is the load-bearing pair — it favours experts
strongly useful to at least one class and broadly useful across classes, the
optimizer-off-manifold
lesson encoded as a recipe. The [24,40] floor is the 98e-scaled analogue of the
62e [15,25] band that won the loop-floor study, and beats [20,35] by ~3.6pp
LCB-55 for a coder.
2. Calibration class weights — code only
Ten contribution classes are scored; the weights steer which specialists survive. v7-coderx zeroes every non-code targeting term:
| Class | v7-coderx | v7-coder |
|---|---|---|
| generic_math | 1 | 1 |
| generic_logic | 1 | 1 |
| generic_code | 3 | 3 |
| generic_science | 1 | 1 |
| generic_creative | 1 | 1 |
| generic_multilingual | 0 | 0 |
| targeted_humaneval | 0 | 0 |
| targeted_humanevalplus | 0 | 0 |
| targeted_lcb_medium_55 | 2 | 2 |
| targeted_gpqa | 0 | 1.5 |
HE/HE+ targeting is off because both already sit at/above the un-targeted baseline;
the protection budget goes to LiveCodeBench-medium, the bench where pruning hurt
most on earlier variants. v7-coderx is exactly v7-coder minus the targeted_gpqa
term — the two share most of their keep set, which is why HE+ and IFEval match to
the point and only LCB-55 (up) and GPQA (down) move.
3. Mandatory shared-FFN α=1.2 (cohort rule)
After expert drop, router_shared_upweight.py --alpha 1.2 --target mlp.down_proj.weight
upweights Gemma 4's always-on shared FFN. Every coder variant carries this; omitting
it yields the "weak / ruminating" pre-shared baseline and makes cross-variant
comparison unfair. A .shared_applied marker records it.
Intended use
A compact (~12–13 GB at Q4_K_M / NVFP4A16, fits a single 12–16 GB GPU) Gemma 4
checkpoint for maximal coding throughput and instruction-following — the
code-extreme (x) member of the v7-coder cohort. If your workload also needs strong
graduate science, use v7-coder,
which trades ~1.8pp LCB-55 for ~+22pp GPQA.
Inherits Gemma 4's thinking format — serve with the reasoning parser enabled
(--reasoning-parser gemma4 on vLLM; --reasoning-format deepseek --reasoning-budget 8192
on llama-server).
Limitations
A research prune, not an official Google release. Expert pruning trades breadth for
size: generic_multilingual is de-weighted (0×) and graduate science (GPQA) is
the explicit budget axis — at 48.48% it is well below the unpruned 128e (67.17% on
the same Q6_K run). Choose v7-coder if science matters. Quality below ~Q3 / 3-bit
degrades on the Gemma 4 MoE — prefer Q4_K_M or higher for production. The GGUF and
NVFP4A16 formats are provided for deployment but are not separately benchmarked.
Lineage
128e → (v4 → v5 → v6-coder code line) → v7 competence-map rebuild → fs2440 code floor = v7-coderx. Built and evaluated on the omnimergekit toolchain.
- Downloads last month
- 47