Gemma 4 A4B 98-Expert v6-coder (C6v3lcb) — LCB-targeted code prune (~20.8B)

Eval complete (Q6_K / llama.cpp). The full canonical 9-bench suite plus the extended LCB-medium-100 and MultiPL-E-100 code benches are filled below, every cell read from summary.json (greedy, cohort-pinned recipe). The 128e and 98e-v5-coder anchor columns are the matching Q6_K reference runs. NVFP4A16 is a deployment format and is not separately benchmarked (cohort policy).

Headline — the LCB targeting worked. LCB-medium-55 96.36% (+10.91pp vs v5-coder, and +1.81pp past the unpruned 128e), closing the −9.10pp hole that motivated the recipe; MultiPL-E macro 88.0% (+7 vs v5-coder, ≈128e); AIME recovers +10pp (53.33 → 63.33). The budget is paid on the non-code axes the LCB-only class weights deprioritized (MATH −3, IFEval −2 vs v5-coder).

A research checkpoint that prunes the unpruned Gemma 4 26B-A4B-it (128 experts, top-8 + shared, 30 layers) down to 98 experts per layer using a drop map that is the most code-faithful member of the v6-coder family: v5-coder's gentle C6 layer-relevance-weighted v4-floor recipe, re-derived on the corrected v3 code-pass calibration data, then steered specifically at LiveCodeBench-medium — the one code bench where expert pruning hurt most (−9.10pp vs 128e on v5-coder). Same 98e shape, same router, same attention, same norms as the rest of the cohort, plus the mandatory shared-FFN α=1.2 upweight all coder variants carry.

This is the head of the C6 → C6v3 → C6v3lcb line: each step holds the gentle C6 recipe fixed and changes exactly one variable (data, then target weighting), so the deltas are attributable.

Quantized formats

Format Repo Notes
bf16 (this repo) ManniX-ITA/gemma-4-A4B-98e-v6-coder-it 9 shards, ~40.9 GB. The C6v3lcb drop map + shared α=1.2.
GGUF (llama.cpp / ollama) ManniX-ITA/gemma-4-A4B-98e-v6-coder-it-GGUF Bartowski tier sweep (imatrix K-quants; Q4_K_M is plain) + ContribDynamic CD-* per-layer quants + F16 baseline. Tier sweep complete — every tier HE+-scored; see the GGUF card for the per-tier table.
NVFP4A16 (vLLM) ManniX-ITA/gemma-4-A4B-98e-v6-coder-NVFP4A16 (planned) ~13 GB, native vLLM, via modelopt==0.43.0. Deployment format — not separately benchmarked.
Ollama mannix/gemma4-98e-v6-coder GGUF tier sweep, ollama pull mannix/gemma4-98e-v6-coder:<tier> (:latest = Q4_K_M). All tiers pushed.

Eval is llama.cpp / Q6_K only for this cohort. NVFP4A16 is published as a vLLM-deployable format but is not benchmarked separately — the Q6_K scoreboard below is representative of the model's quality.

At a glance

128e (base) 98e v5-coder 98e v6-coder (C6v3lcb)
Total params ~26B ~20.8B ~20.8B
Active params / token ~4B (top-8 + shared) ~4B ~4B
Experts per layer 128 98 (30 dropped) 98 (30 dropped)
Layers 30 30 30
Drop map C6 layer-relevance v4-floor, breadth=50, _fixed data C6 v4-floor, breadth=50, v3 data + outlier-fix, LCB-targeted
Calibration classes weighted code + HE + HE+ + LCB code (3×) + LCB-medium (2×), HE/HE+ targeting OFF
Shared FFN α 1.0 1.0 (none) 1.2 (mlp.down_proj)
Built from 98e v4 (re-mapped) 128e original (fresh prune)

Recipe

The drop map is produced by generate_drop_map_multiclass.py (omnimergekit) from per-expert, per-class contribution scores, then applied with expert_drop.py, then the shared expert is upweighted. Four design choices define C6v3lcb; each is held constant from v5-coder except where noted.

1. C6 gentle base recipe (unchanged from v5-coder)

The aggregation that ranks experts for dropping:

strategy      = max              # per-expert score = MAX over classes (not mean/geomean)
normalize     = rank             # rank-normalize within each (layer, class) before aggregating
protect_top   = 16               # the 16 highest-scoring experts/layer are never dropped
alpha         = 2.0              # contribution sharpening exponent
breadth_bonus = 0.5              # reward experts active across many classes (anti-overfit)
v4_floor_map  = v4_layer_floor_map.json   # per-layer floor: never drop below v4's keep on protected layers
baseline      = teacher_force_98e_p16_clean.json   # tie-break anchor

strategy=max + breadth_bonus is the load-bearing pair: it favors experts that are strongly useful to at least one class and broadly useful across classes, rather than experts with a high average that no single task depends on. This is the optimizer-off-manifold lesson encoded as a recipe — max/percentile beats mean/geomean for importance.

2. v3 calibration data + outlier fix (changed from v5-coder)

v5-coder ranked experts using the _fixed code-pass traces, which turned out to be ~86% NaN in the deep layers — accidentally NaN-blind where the model has real signal. v6-coder uses the corrected expert_neuron_v5_code_v3.json (real deep-layer signal restored, T73.0 fp32-hot-path patch), and scrubs the residual bf16 weight-norm artifacts with a median-based outlier clamp:

data                = expert_neuron_v5_code_v3.json
outlier_mode        = median
outlier_wnorm_thresh = 1e4       # clamp expert weight-norms above 1e4 to the layer median

Calibration corpus (v5_code_pass_traces.json, 360 traces): Tier-A 128-token comprehension prompts (342 traces, 1×) + Tier-B 2048-token windowed pass-traces (18 traces, 3×) — the long-window tier captures sustained code reasoning, which is where pruned variants ruminate.

3. LCB-only targeting — the "lcb" in C6v3lcb (changed)

Eight contribution classes are scored; the class weights steer which specialists are protected. v6-coder zeroes the HumanEval targeting and concentrates the targeted budget on LiveCodeBench-medium:

Class C6v3 weight C6v3lcb weight
generic_math 1 1
generic_logic 1 1
generic_code 3 3
generic_science 1 1
generic_creative 1 1
targeted_humaneval 2 0
targeted_humanevalplus 2 0
targeted_lcb_medium_55 2 2

Rationale: HE/HE+ were already at/above 128e on v5-coder (+1.22 / +1.22), so spending protection budget on them is wasted; LCB-medium was the −9.10pp hole. Removing the HE targeting frees the floor to protect LCB-relevant experts harder. The resulting map is 95% identical to C6v3 (generic_code 3× dominates), so HE+/IFEval are expected at ≈C6v3 — the bench that tests the hypothesis is LCB-medium.

4. Mandatory shared-FFN α=1.2 (cohort rule)

After expert drop, router_shared_upweight.py --alpha 1.2 --target mlp.down_proj.weight scales the always-on shared expert's down-projection by 1.2×. Without it the pruned model is the "weak/ruminating" pre-T18 baseline; every coder variant carries it or the v5-coder comparison is unfair. Verified by the .shared_applied marker in this repo.


Expert mapping

  • Uniform 30 experts dropped per layer, all 30 layers → 98 kept/layer (128 → 98).
  • Mean overlap with the teacher-force baseline: 17.8 / 30 dropped per layer (min 11, max 24). The recipe re-ranks ~12 of 30 dropped slots/layer away from the naive teacher-force drop — that re-ranking is where the code-faithfulness lives.
  • Aggregated keep scores (rank-normalized, α=2.0, breadth-bonus) span ~0.0–3.5 per layer with mean ~1.85; the protected top-16 sit at the high end, the dropped 30 at the low end, with boundary_ties_within_1pct 2–39/layer (highest in the shallow layers 0/10 — those layers have the flattest expert-importance profile).

Full per-layer keep lists are in expert_drop_metadata.json (per_layer_keep) in this repo; the ranking provenance (per-layer agg min/max/mean, ties, overlap-vs-baseline, v5-only non-overlap experts) is in scripts/v6coder_C6v3lcb_drop_map.json.summary.json.


Problems solved

  1. Rumination on hard code/reasoning problems. Pruned Gemma 4 variants fall into <think> loops on the hardest LCB / GPQA / AIME problems, blowing the token budget without converging. C6v3lcb attacks this two ways: (a) the LCB-targeted drop map keeps the experts that the hard-problem pass-traces actually use; (b) at eval time the lcb_medium_*_v4 template applies a thinking_token_budget=12288 forcing function (parser=gemma4 + enable_thinking) that caps the rumination loop. T109 rumination-signal gate (GPQA-48 / AIME probes) confirmed the variant generates and converges rather than looping.
  2. Data-vs-recipe disentanglement (T102). v6-coder isolates whether v3's earlier regressions were the data or the recipe: holding v5-coder's gentle C6 recipe fixed and only swapping in v3 data + the outlier fix. C6v3 ≈ v5-coder in smoke → the v3 data is fine and C12's aggressive recipe was the rumination cause; C6v3lcb then re-steers that clean baseline at LCB.
  3. Deep-layer calibration corruption (T73.0). The _fixed traces were ~86% NaN in deep layers; v6-coder uses real-signal v3 data with a median outlier clamp so deep-layer experts are ranked on genuine contribution, not NaN-blindness.
  4. Unfair-comparison trap. The shared α=1.2 step is mandatory and marker-verified, so v6-coder vs v5-coder is apples-to-apples.

Scoreboard — Q6_K GGUF, llama.cpp, greedy

Full 9-bench llama.cpp Q6_K run, llama-server --reasoning-format deepseek --reasoning-budget 12288 --parallel 2, greedy (T=0, top_p=1, top_k=0, do_sample=false). The 128e and v5-coder columns are the bartowski-Q6_K / v5-coder-Q6_K reference runs under the identical recipe (apples-to-apples within the llama.cpp/Q6_K backend).

Bench (n) 128e Q6_K v5-coder Q6_K v6-coder Q6_K Δ (v6 − v5c) Δ (v6 − 128e)
ARC-Challenge-chat (1172) 97.10% 95.73% 95.82% +0.09 −1.28
GPQA Diamond flex (198) 72.73% 65.15% 67.17% +2.02 −5.56
GSM8K-100 flex 92.00% 87.00% 91.00% +4.00 −1.00
MATH-500-100 math_verify 94.00% 94.00% 91.00% −3.00 −3.00
AIME 2024 (30) 83.33% 53.33% 63.33% +10.00 −20.00
IFEval-100 (prompt_strict) 97.00% 94.00% 92.00% −2.00 −5.00
HumanEval-164 chat 96.34% 99.39% 98.78% −0.61 +2.44
HumanEval+-164 chat 90.85% 93.29% 93.29% 0.00 +2.44
LCB-medium-55 v4 94.55% 85.45% 96.36% +10.91 +1.81

Read: v6-coder lands on/above v5-coder on 8 of 9 benches and beats the unpruned 128e on every code bench (HE +2.44, HE+ +2.44, LCB-55 +1.81). The headline is LCB-medium-55 +10.91pp vs v5-coder — the targeted hole is not just closed but pushed past the base model. AIME recovers +10pp (53.33 → 63.33). The cost lands on MATH (−3) and IFEval (−2) vs v5-coder — the non-code generalist axes the LCB-only class weights (1 1 3 1 1 0 0 2) deliberately deprioritized.

Extended code benches — LCB-medium-100 + MultiPL-E-100 (Q6_K, llama.cpp, greedy)

Run on solidpc (T112), same Q6_K / greedy recipe; MultiPL-E scored via the nuprl/multipl-e-evaluation Docker image. 128e and v5-coder columns from the v5-coder card.

LCB-medium-100 (100-problem superset of LCB-medium-55 v4):

Bench (n) 128e Q6_K v5-coder Q6_K v6-coder Q6_K Δ (v6 − v5c)
LCB-medium-100 95.00% 91.00% 96.00% +5.00

(v6-coder also clears 128e on LCB-100 by +1.00pp — the LCB-targeting win holds on the 100-problem superset, not just the 55-problem v4 set.)

MultiPL-E-100 (HumanEval → Rust / Java / JS, 100/lang, chat-mode + code extraction):

Language (n=100) 128e Q6_K v5-coder Q6_K v6-coder Q6_K Δ (v6 − v5c)
Rust 83.00% 76.00% 82.00% +6.00
Java 91.00% 81.00% 89.00% +8.00
JavaScript 95.00% 86.00% 93.00% +7.00
Macro mean 89.67% 81.00% 88.00% +7.00

v6-coder near-fully recovers MultiPL-E to the 128e level (macro −1.67pp) from v5-coder's −8.67pp gap — code generalization in non-Python languages tracks the LCB-targeting win, not just the in-distribution LCB benches. (264/300 passed; micro = macro = 88.0%.)


Answer-length analysis (anti-rumination)

v6-coder thinks longer on average than 128e on the code benches — it spends the full bounded thinking_token_budget=12288 where 128e often answers in a few hundred tokens. The question this section answers is whether that extra length is rumination (long thinking that fails to reach an answer — the failure mode the LCB targeting was built to fix) or productive reasoning that lands on a PASS. The tables below compare per-problem completion length against 128e and v5-coder on the same problems, from omk_eval token_stats (characters from the raw completion text; tokens via the 128e tokenizer), on the real-n benches. 128e and v5-coder are the matching Q6_K runs. The short answer: by the rumination-as-wasted-thinking definition, v6-coder ruminates less than 128e, not more — 128e's longest LCB outputs (up to 51k chars) are failures, while v6-coder's long outputs overwhelmingly pass.

Per-problem completion length — characters (p50 / p90 / max):

Bench (n) 128e v5-coder v6-coder
GPQA Diamond (198) 2558 / 17661 / 26572 2602 / 16768 / 26518 2584 / 17162 / 25218
AIME 2024 (30) 2091 / 6192 / 7198 2405 / 9239 / 10974 2190 / 8404 / 9689
LCB-medium-55 4369 / 14894 / 51568 31222 / 38456 / 44868 30685 / 36487 / 41301
LCB-medium-100 1899 / 14894 / 51568 30035 / 36846 / 44868 29514 / 36221 / 43845
MultiPL-E-100 (300) 244 / 592 / 2944 257 / 764 / 3083 246 / 594 / 2169
MATH-500 (100) 1054 / 1970 / 7520 1145 / 1925 / 7189 1087 / 1961 / 8420
GSM8K (100) 254 / 699 / 3438 264 / 698 / 2386 272 / 682 / 1332
IFEval (100) 781 / 3595 / 7425 862 / 4150 / 17539 850 / 3803 / 7200
HumanEval (164) 704 / 1303 / 8578 696 / 1408 / 3923 721 / 1296 / 4033
HumanEval+ (164) 684 / 1238 / 5704 709 / 1419 / 3578 715 / 1451 / 4498
ARC-Challenge (1172) 1203 / 1662 / 6407 1201 / 1655 / 43570 1217 / 1663 / 43927

Per-problem completion length — tokens (p50 / p90 / max):

Bench (n) 128e v5-coder v6-coder
GPQA Diamond (198) 843 / 8189 / 8189 855 / 8189 / 8189 876 / 8189 / 8189
AIME 2024 (30) 960 / 3993 / 4021 1156 / 4012 / 4022 939 / 4011 / 4021
LCB-medium-55 1213 / 4644 / 15585 12804 / 13295 / 15724 12799 / 13184 / 15886
LCB-medium-100 565 / 4644 / 15947 12726 / 13156 / 15945 12735 / 13103 / 15724
MultiPL-E-100 (300) 85 / 184 / 1012 89 / 247 / 1017 86 / 184 / 1013
MATH-500 (100) 423 / 925 / 3035 445 / 802 / 3319 404 / 872 / 3318
GSM8K (100) 123 / 253 / 1107 114 / 221 / 858 120 / 254 / 396
IFEval (100) 202 / 826 / 1427 217 / 912 / 4058 234 / 850 / 4060
HumanEval (164) 226 / 431 / 2890 226 / 413 / 1363 226 / 443 / 1305
HumanEval+ (164) 223 / 430 / 1744 229 / 420 / 1209 226 / 495 / 1251
ARC-Challenge (1172) 255 / 359 / 1453 258 / 360 / 16266 262 / 361 / 16266

Budget-saturation incidence — share of problems whose completion reached ≥12k tokens (at/near the thinking_token_budget=12288 cap). Saturation by itself is not rumination: a saturated output that reaches a correct PASS is productive use of the budget. The pruned variants saturate on nearly everything; 128e almost never does — but see the next table.

Bench (n) 128e v5-coder v6-coder
LCB-medium-55 1 / 55 (1.8%) 54 / 55 (98.2%) 53 / 55 (96.4%)
LCB-medium-100 2 / 100 (2.0%) 99 / 100 (99.0%) 97 / 100 (97.0%)

Rumination — long thinking that fails to PASS. The right metric is not median length (128e looks short only because it answers easy problems fast). It is the share of the model's budget-saturated outputs that still fail — i.e. tokens burned without a correct answer:

Bench (n) 128e v5-coder v6-coder
LCB-medium-55 — saturated-and-failed 1 / 1 (100%) 8 / 54 (15%) 2 / 53 (3.8%)
LCB-medium-100 — saturated-and-failed 2 / 2 (100%) 9 / 99 (9%) 4 / 97 (4.1%)
LCB-100 — mean completion tokens, PASS vs FAIL 1468 vs 8368 12687 vs 14241 12616 vs 14302

Every 128e output that reaches the budget cap is a failure (100%), and 128e's failed problems run 5.7× longer than its passed ones (mean 8368 vs 1468 tok): 128e only thinks long when it is lost. v6-coder thinks long on ~97% of problems but 96% of those long outputs PASS — its long thinking is overwhelmingly productive, not rumination.

Key findings:

  • LCB — v6-coder ruminates less than 128e, under the only definition that matters (long thinking that fails to reach a PASS). The naive read of the median says the opposite — 128e's median LCB completion is ~0.6–1.2k tokens vs v6-coder's budget-capped ~12.7k — but that short median is an artifact of mixing easy and hard problems. 128e only thinks long when it is lost, and then it fails: every 128e output that saturates the budget is a failure (2/2 on LCB-100, 1/1 on LCB-55), and the single longest output across all three models — 51,568 chars / 15,585 tokens on lcb/leetcode/3659 — is a 128e failure, one that both v6-coder (PASS, 12,691 tok / 32,676 chars) and v5-coder solve in two-thirds the length. v6-coder, by contrast, saturates the budget on ~97% of problems but 96% of those long outputs PASS; its longest LCB-55 output is a correct answer and its max length (41–43k chars) sits below 128e's 51k. Wasted long-thinking rate: 128e 100%, v6-coder ~4%. And v6-coder beats 128e on score on both axes (96.36 / 96.00 vs 94.55 / 95.00) — more correct and a lower worst-case tail.
  • 128e's routing is erratic in both directions, and the targeted map repairs both. 128e fails LCB two ways: by over-thinking (3659: 51k chars, no answer) and by under-thinking (lcb/leetcode/3000: a premature wrong answer in 579 tokens; 3793: 3101 tok) — on the very same problems v6-coder spends the full ~13k-token budget and passes. This bidirectional instability is the fingerprint of an imperfect base router; the 51k outliers and the lower 128e score are the symptom. v6-coder's LCB-targeted expert map corrects 3 of the 5 problems 128e fails on LCB-100 (2 of 3 on LCB-55). The thinking_token_budget=12288 forcing-function only bounds the loop; the targeted map is what makes the bounded thinking land on a PASS instead of a 51k-char dead end. (Versus v5-coder, v6-coder is statistically the same length — median 12799 vs 12804 tok on LCB-55 — yet passes far more: 53/55 vs 47/55, 96/100 vs 91/100.)
  • AIME — genuinely less rumination, +10pp. Median completion 939 tok vs v5-coder's 1156, p90 char tail 8404 < 9239, scoring 63.33% vs 53.33% — shorter chains and more correct.
  • GSM8K — tightest of the three (max 396 tok; 128e 1107, v5-coder 858); no rumination tail.
  • IFEval — no ruminating outlier (max 7200 chars vs v5-coder's 17539; ≈128e 7425).
  • GPQA — a base-model ceiling, not a prune artifact. p90 completion tokens peg at ~8189 on all three models; the hard-question rumination is inherited from Gemma 4 and is identical across the cohort, not widened by pruning.
  • ARC — one shared single-sample outlier (~16266 tok / ~44k chars) on both pruned variants, absent on 128e; p50/p90 normal and identical, so not systemic length growth.
  • MultiPL-E — most compact code of the three. On the bench where the answer is the code (extracted final block, no reasoning trace), v6-coder has the tightest tail — max 2169 chars vs 128e 2944 / v5-coder 3083, p90 591 chars / 184 tok — at ≈128e pass-rate (macro 88.0%).
  • Net: v6-coder is more accurate than 128e on every LCB axis (96.36 / 96.00 / MultiPL-E macro 88.0) at a higher average inference cost (it thinks the full bounded budget) but a lower worst case (max 41–43k chars vs 128e's 51k) and a far lower wasted-thinking rate (~4% vs 100%). On the short-reasoning benches (AIME / GSM8K / IFEval) it additionally reduces length vs v5-coder. The targeted map never traded length for accuracy — it traded 128e's erratic, sometimes-catastrophic tail for a higher but bounded and productive average.

Methodology / data integrity. Per-problem lengths come from omk_eval token_stats over each bench's per-problem samples archive — lm-eval samples_*.jsonl for the 9-bench, and the native runners' lcb_result.samples.jsonl / mpe_result.samples.jsonl (mirrored into the durable sqlite resume cache) for LCB/MPE. MultiPL-E is tabulated, but its rows measure extracted-code length, not reasoning: its samples store only the final code block (no <think> trace, finish_reasons all terminal, thinking_tokens_est=0), so MPE is a code-conciseness reference — not a rumination metric — on which v6-coder is the most compact. The reasoning-bearing benches (LCB / GPQA / AIME / MATH) carry the rumination signal; MPE tokens there are tokenizer-estimated (the code-extraction path returns no server token count). thinking_tokens_est is 0 across all benches under llama.cpp --reasoning-format deepseek (reasoning is emitted inline, not as a separable block), so the completion-token columns are the length signal. The LCB/MPE token_stats are computed over per-problem rows keyed on doc_id (the native runners now emit it, fixed 2026-05-25) — earlier card revisions briefly collapsed these to n=1; all numbers here are at full n (LCB-55 n=55, LCB-100 n=100, MultiPL-E n=300).


Reproduce

# 1. drop map (omnimergekit)
python scripts/generate_drop_map_multiclass.py \
    --data scripts/expert_neuron_v5_code_v3.json --target 98 \
    --protect-top 16 --alpha 2.0 --strategy max --normalize rank \
    --breadth-bonus 0.5 --v4-floor-map scripts/v4_layer_floor_map.json \
    --baseline-drop-map scripts/teacher_force_98e_p16_clean.json \
    --outlier-mode median --outlier-wnorm-thresh 1e4 \
    --class-weights 1 1 3 1 1 0 0 2 \
    --output scripts/v6coder_C6v3lcb_drop_map.json

# 2. prune + mandatory shared upweight (see scripts/build_v6coder_C6v3lcb.sh)
python scripts/expert_drop.py --source-dir google/gemma-4-26B-A4B-it \
    --drop-map scripts/v6coder_C6v3lcb_drop_map.json --suffix=-v6-coder-C6v3lcb-it
python scripts/router_shared_upweight.py \
    --model-dir google/gemma-4-A4B-98e-v6-coder-C6v3lcb-it \
    --alpha 1.2 --target mlp.down_proj.weight

Lineage & provenance

  • Base: google/gemma-4-26B-A4B-it (128e, fresh prune — not derived from v4/v5 weights).
  • Cohort: v5-coder (C6 _fixed) → C6v3 (C6 v3-data) → C6v3lcb (this, LCB-targeted).
  • Tooling: omnimergekitexpert_drop.py, generate_drop_map_multiclass.py, router_shared_upweight.py; eval via omk_eval.py + EVAL_PROTOCOL.
  • Eval sampler is frozen GREEDY across the whole cohort for apples-to-apples comparison.

License: Gemma Terms of Use. This is a derivative of Gemma 4 and inherits those terms.

Downloads last month
104
Safetensors
Model size
20B params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for ManniX-ITA/gemma-4-A4B-98e-v6-coder-it

Finetuned
(95)
this model
Quantizations
1 model