Gemma 4 A4B 98-Expert v6-coder (C6v3lcb) — LCB-targeted code prune (~20.8B)
Eval complete (Q6_K / llama.cpp). The full canonical 9-bench suite plus the extended LCB-medium-100 and MultiPL-E-100 code benches are filled below, every cell read from
summary.json(greedy, cohort-pinned recipe). The 128e and 98e-v5-coder anchor columns are the matching Q6_K reference runs. NVFP4A16 is a deployment format and is not separately benchmarked (cohort policy).Headline — the LCB targeting worked. LCB-medium-55 96.36% (+10.91pp vs v5-coder, and +1.81pp past the unpruned 128e), closing the −9.10pp hole that motivated the recipe; MultiPL-E macro 88.0% (+7 vs v5-coder, ≈128e); AIME recovers +10pp (53.33 → 63.33). The budget is paid on the non-code axes the LCB-only class weights deprioritized (MATH −3, IFEval −2 vs v5-coder).
A research checkpoint that prunes the unpruned Gemma 4 26B-A4B-it (128 experts, top-8 + shared, 30 layers) down to 98 experts per layer using a drop map that is the most code-faithful member of the v6-coder family: v5-coder's gentle C6 layer-relevance-weighted v4-floor recipe, re-derived on the corrected v3 code-pass calibration data, then steered specifically at LiveCodeBench-medium — the one code bench where expert pruning hurt most (−9.10pp vs 128e on v5-coder). Same 98e shape, same router, same attention, same norms as the rest of the cohort, plus the mandatory shared-FFN α=1.2 upweight all coder variants carry.
This is the head of the C6 → C6v3 → C6v3lcb line: each step holds the gentle C6 recipe fixed and changes exactly one variable (data, then target weighting), so the deltas are attributable.
Quantized formats
| Format | Repo | Notes |
|---|---|---|
| bf16 (this repo) | ManniX-ITA/gemma-4-A4B-98e-v6-coder-it |
9 shards, ~40.9 GB. The C6v3lcb drop map + shared α=1.2. |
| GGUF (llama.cpp / ollama) | ManniX-ITA/gemma-4-A4B-98e-v6-coder-it-GGUF |
Bartowski tier sweep (imatrix K-quants; Q4_K_M is plain) + ContribDynamic CD-* per-layer quants + F16 baseline. Tier sweep complete — every tier HE+-scored; see the GGUF card for the per-tier table. |
| NVFP4A16 (vLLM) | ManniX-ITA/gemma-4-A4B-98e-v6-coder-NVFP4A16 (planned) |
~13 GB, native vLLM, via modelopt==0.43.0. Deployment format — not separately benchmarked. |
| Ollama | mannix/gemma4-98e-v6-coder |
GGUF tier sweep, ollama pull mannix/gemma4-98e-v6-coder:<tier> (:latest = Q4_K_M). All tiers pushed. |
Eval is llama.cpp / Q6_K only for this cohort. NVFP4A16 is published as a vLLM-deployable format but is not benchmarked separately — the Q6_K scoreboard below is representative of the model's quality.
At a glance
| 128e (base) | 98e v5-coder | 98e v6-coder (C6v3lcb) | |
|---|---|---|---|
| Total params | ~26B | ~20.8B | ~20.8B |
| Active params / token | ~4B (top-8 + shared) | ~4B | ~4B |
| Experts per layer | 128 | 98 (30 dropped) | 98 (30 dropped) |
| Layers | 30 | 30 | 30 |
| Drop map | — | C6 layer-relevance v4-floor, breadth=50, _fixed data |
C6 v4-floor, breadth=50, v3 data + outlier-fix, LCB-targeted |
| Calibration classes weighted | — | code + HE + HE+ + LCB | code (3×) + LCB-medium (2×), HE/HE+ targeting OFF |
| Shared FFN α | 1.0 | 1.0 (none) | 1.2 (mlp.down_proj) |
| Built from | — | 98e v4 (re-mapped) | 128e original (fresh prune) |
Recipe
The drop map is produced by generate_drop_map_multiclass.py (omnimergekit) from
per-expert, per-class contribution scores, then applied with expert_drop.py,
then the shared expert is upweighted. Four design choices define C6v3lcb; each is
held constant from v5-coder except where noted.
1. C6 gentle base recipe (unchanged from v5-coder)
The aggregation that ranks experts for dropping:
strategy = max # per-expert score = MAX over classes (not mean/geomean)
normalize = rank # rank-normalize within each (layer, class) before aggregating
protect_top = 16 # the 16 highest-scoring experts/layer are never dropped
alpha = 2.0 # contribution sharpening exponent
breadth_bonus = 0.5 # reward experts active across many classes (anti-overfit)
v4_floor_map = v4_layer_floor_map.json # per-layer floor: never drop below v4's keep on protected layers
baseline = teacher_force_98e_p16_clean.json # tie-break anchor
strategy=max + breadth_bonus is the load-bearing pair: it favors experts that
are strongly useful to at least one class and broadly useful across classes,
rather than experts with a high average that no single task depends on. This is
the optimizer-off-manifold
lesson encoded as a recipe — max/percentile beats mean/geomean for importance.
2. v3 calibration data + outlier fix (changed from v5-coder)
v5-coder ranked experts using the _fixed code-pass traces, which turned out to be
~86% NaN in the deep layers — accidentally NaN-blind where the model has real
signal. v6-coder uses the corrected expert_neuron_v5_code_v3.json (real deep-layer
signal restored, T73.0 fp32-hot-path patch), and scrubs the residual bf16 weight-norm
artifacts with a median-based outlier clamp:
data = expert_neuron_v5_code_v3.json
outlier_mode = median
outlier_wnorm_thresh = 1e4 # clamp expert weight-norms above 1e4 to the layer median
Calibration corpus (v5_code_pass_traces.json, 360 traces): Tier-A 128-token
comprehension prompts (342 traces, 1×) + Tier-B 2048-token windowed pass-traces
(18 traces, 3×) — the long-window tier captures sustained code reasoning, which is
where pruned variants ruminate.
3. LCB-only targeting — the "lcb" in C6v3lcb (changed)
Eight contribution classes are scored; the class weights steer which specialists are protected. v6-coder zeroes the HumanEval targeting and concentrates the targeted budget on LiveCodeBench-medium:
| Class | C6v3 weight | C6v3lcb weight |
|---|---|---|
| generic_math | 1 | 1 |
| generic_logic | 1 | 1 |
| generic_code | 3 | 3 |
| generic_science | 1 | 1 |
| generic_creative | 1 | 1 |
| targeted_humaneval | 2 | 0 |
| targeted_humanevalplus | 2 | 0 |
| targeted_lcb_medium_55 | 2 | 2 |
Rationale: HE/HE+ were already at/above 128e on v5-coder (+1.22 / +1.22), so spending protection budget on them is wasted; LCB-medium was the −9.10pp hole. Removing the HE targeting frees the floor to protect LCB-relevant experts harder. The resulting map is 95% identical to C6v3 (generic_code 3× dominates), so HE+/IFEval are expected at ≈C6v3 — the bench that tests the hypothesis is LCB-medium.
4. Mandatory shared-FFN α=1.2 (cohort rule)
After expert drop, router_shared_upweight.py --alpha 1.2 --target mlp.down_proj.weight
scales the always-on shared expert's down-projection by 1.2×. Without it the pruned
model is the "weak/ruminating" pre-T18 baseline; every coder variant carries it or
the v5-coder comparison is unfair. Verified by the .shared_applied marker in this repo.
Expert mapping
- Uniform 30 experts dropped per layer, all 30 layers → 98 kept/layer (128 → 98).
- Mean overlap with the teacher-force baseline: 17.8 / 30 dropped per layer (min 11, max 24). The recipe re-ranks ~12 of 30 dropped slots/layer away from the naive teacher-force drop — that re-ranking is where the code-faithfulness lives.
- Aggregated keep scores (rank-normalized, α=2.0, breadth-bonus) span ~0.0–3.5
per layer with mean ~1.85; the protected top-16 sit at the high end, the dropped
30 at the low end, with
boundary_ties_within_1pct2–39/layer (highest in the shallow layers 0/10 — those layers have the flattest expert-importance profile).
Full per-layer keep lists are in expert_drop_metadata.json (per_layer_keep) in
this repo; the ranking provenance (per-layer agg min/max/mean, ties, overlap-vs-baseline,
v5-only non-overlap experts) is in scripts/v6coder_C6v3lcb_drop_map.json.summary.json.
Problems solved
- Rumination on hard code/reasoning problems. Pruned Gemma 4 variants fall into
<think>loops on the hardest LCB / GPQA / AIME problems, blowing the token budget without converging. C6v3lcb attacks this two ways: (a) the LCB-targeted drop map keeps the experts that the hard-problem pass-traces actually use; (b) at eval time thelcb_medium_*_v4template applies athinking_token_budget=12288forcing function (parser=gemma4 + enable_thinking) that caps the rumination loop. T109 rumination-signal gate (GPQA-48 / AIME probes) confirmed the variant generates and converges rather than looping. - Data-vs-recipe disentanglement (T102). v6-coder isolates whether v3's earlier regressions were the data or the recipe: holding v5-coder's gentle C6 recipe fixed and only swapping in v3 data + the outlier fix. C6v3 ≈ v5-coder in smoke → the v3 data is fine and C12's aggressive recipe was the rumination cause; C6v3lcb then re-steers that clean baseline at LCB.
- Deep-layer calibration corruption (T73.0). The
_fixedtraces were ~86% NaN in deep layers; v6-coder uses real-signal v3 data with a median outlier clamp so deep-layer experts are ranked on genuine contribution, not NaN-blindness. - Unfair-comparison trap. The shared α=1.2 step is mandatory and marker-verified, so v6-coder vs v5-coder is apples-to-apples.
Scoreboard — Q6_K GGUF, llama.cpp, greedy
Full 9-bench llama.cpp Q6_K run, llama-server --reasoning-format deepseek --reasoning-budget 12288 --parallel 2, greedy (T=0, top_p=1, top_k=0, do_sample=false).
The 128e and v5-coder columns are the bartowski-Q6_K / v5-coder-Q6_K reference runs under
the identical recipe (apples-to-apples within the llama.cpp/Q6_K backend).
| Bench (n) | 128e Q6_K | v5-coder Q6_K | v6-coder Q6_K | Δ (v6 − v5c) | Δ (v6 − 128e) |
|---|---|---|---|---|---|
| ARC-Challenge-chat (1172) | 97.10% | 95.73% | 95.82% | +0.09 | −1.28 |
| GPQA Diamond flex (198) | 72.73% | 65.15% | 67.17% | +2.02 | −5.56 |
| GSM8K-100 flex | 92.00% | 87.00% | 91.00% | +4.00 | −1.00 |
| MATH-500-100 math_verify | 94.00% | 94.00% | 91.00% | −3.00 | −3.00 |
| AIME 2024 (30) | 83.33% | 53.33% | 63.33% | +10.00 | −20.00 |
| IFEval-100 (prompt_strict) | 97.00% | 94.00% | 92.00% | −2.00 | −5.00 |
| HumanEval-164 chat | 96.34% | 99.39% | 98.78% | −0.61 | +2.44 |
| HumanEval+-164 chat | 90.85% | 93.29% | 93.29% | 0.00 | +2.44 |
| LCB-medium-55 v4 | 94.55% | 85.45% | 96.36% | +10.91 | +1.81 |
Read: v6-coder lands on/above v5-coder on 8 of 9 benches and beats the unpruned 128e on every code bench (HE +2.44, HE+ +2.44, LCB-55 +1.81). The headline is LCB-medium-55 +10.91pp vs v5-coder — the targeted hole is not just closed but pushed past the base model. AIME recovers +10pp (53.33 → 63.33). The cost lands on MATH (−3) and IFEval (−2) vs v5-coder — the non-code generalist axes the LCB-only class weights (
1 1 3 1 1 0 0 2) deliberately deprioritized.
Extended code benches — LCB-medium-100 + MultiPL-E-100 (Q6_K, llama.cpp, greedy)
Run on solidpc (T112), same Q6_K / greedy recipe; MultiPL-E scored via the
nuprl/multipl-e-evaluation Docker image. 128e and v5-coder columns from the v5-coder card.
LCB-medium-100 (100-problem superset of LCB-medium-55 v4):
| Bench (n) | 128e Q6_K | v5-coder Q6_K | v6-coder Q6_K | Δ (v6 − v5c) |
|---|---|---|---|---|
| LCB-medium-100 | 95.00% | 91.00% | 96.00% | +5.00 |
(v6-coder also clears 128e on LCB-100 by +1.00pp — the LCB-targeting win holds on the 100-problem superset, not just the 55-problem v4 set.)
MultiPL-E-100 (HumanEval → Rust / Java / JS, 100/lang, chat-mode + code extraction):
| Language (n=100) | 128e Q6_K | v5-coder Q6_K | v6-coder Q6_K | Δ (v6 − v5c) |
|---|---|---|---|---|
| Rust | 83.00% | 76.00% | 82.00% | +6.00 |
| Java | 91.00% | 81.00% | 89.00% | +8.00 |
| JavaScript | 95.00% | 86.00% | 93.00% | +7.00 |
| Macro mean | 89.67% | 81.00% | 88.00% | +7.00 |
v6-coder near-fully recovers MultiPL-E to the 128e level (macro −1.67pp) from v5-coder's −8.67pp gap — code generalization in non-Python languages tracks the LCB-targeting win, not just the in-distribution LCB benches. (264/300 passed; micro = macro = 88.0%.)
Answer-length analysis (anti-rumination)
v6-coder thinks longer on average than 128e on the code benches — it spends the full
bounded thinking_token_budget=12288 where 128e often answers in a few hundred tokens. The
question this section answers is whether that extra length is rumination (long thinking
that fails to reach an answer — the failure mode the LCB targeting was built to fix) or
productive reasoning that lands on a PASS. The tables below compare per-problem completion
length against 128e and v5-coder on the same problems, from omk_eval token_stats (characters
from the raw completion text; tokens via the 128e tokenizer), on the real-n benches. 128e and
v5-coder are the matching Q6_K runs. The short answer: by the rumination-as-wasted-thinking
definition, v6-coder ruminates less than 128e, not more — 128e's longest LCB outputs (up
to 51k chars) are failures, while v6-coder's long outputs overwhelmingly pass.
Per-problem completion length — characters (p50 / p90 / max):
| Bench (n) | 128e | v5-coder | v6-coder |
|---|---|---|---|
| GPQA Diamond (198) | 2558 / 17661 / 26572 | 2602 / 16768 / 26518 | 2584 / 17162 / 25218 |
| AIME 2024 (30) | 2091 / 6192 / 7198 | 2405 / 9239 / 10974 | 2190 / 8404 / 9689 |
| LCB-medium-55 | 4369 / 14894 / 51568 | 31222 / 38456 / 44868 | 30685 / 36487 / 41301 |
| LCB-medium-100 | 1899 / 14894 / 51568 | 30035 / 36846 / 44868 | 29514 / 36221 / 43845 |
| MultiPL-E-100 (300) | 244 / 592 / 2944 | 257 / 764 / 3083 | 246 / 594 / 2169 |
| MATH-500 (100) | 1054 / 1970 / 7520 | 1145 / 1925 / 7189 | 1087 / 1961 / 8420 |
| GSM8K (100) | 254 / 699 / 3438 | 264 / 698 / 2386 | 272 / 682 / 1332 |
| IFEval (100) | 781 / 3595 / 7425 | 862 / 4150 / 17539 | 850 / 3803 / 7200 |
| HumanEval (164) | 704 / 1303 / 8578 | 696 / 1408 / 3923 | 721 / 1296 / 4033 |
| HumanEval+ (164) | 684 / 1238 / 5704 | 709 / 1419 / 3578 | 715 / 1451 / 4498 |
| ARC-Challenge (1172) | 1203 / 1662 / 6407 | 1201 / 1655 / 43570 | 1217 / 1663 / 43927 |
Per-problem completion length — tokens (p50 / p90 / max):
| Bench (n) | 128e | v5-coder | v6-coder |
|---|---|---|---|
| GPQA Diamond (198) | 843 / 8189 / 8189 | 855 / 8189 / 8189 | 876 / 8189 / 8189 |
| AIME 2024 (30) | 960 / 3993 / 4021 | 1156 / 4012 / 4022 | 939 / 4011 / 4021 |
| LCB-medium-55 | 1213 / 4644 / 15585 | 12804 / 13295 / 15724 | 12799 / 13184 / 15886 |
| LCB-medium-100 | 565 / 4644 / 15947 | 12726 / 13156 / 15945 | 12735 / 13103 / 15724 |
| MultiPL-E-100 (300) | 85 / 184 / 1012 | 89 / 247 / 1017 | 86 / 184 / 1013 |
| MATH-500 (100) | 423 / 925 / 3035 | 445 / 802 / 3319 | 404 / 872 / 3318 |
| GSM8K (100) | 123 / 253 / 1107 | 114 / 221 / 858 | 120 / 254 / 396 |
| IFEval (100) | 202 / 826 / 1427 | 217 / 912 / 4058 | 234 / 850 / 4060 |
| HumanEval (164) | 226 / 431 / 2890 | 226 / 413 / 1363 | 226 / 443 / 1305 |
| HumanEval+ (164) | 223 / 430 / 1744 | 229 / 420 / 1209 | 226 / 495 / 1251 |
| ARC-Challenge (1172) | 255 / 359 / 1453 | 258 / 360 / 16266 | 262 / 361 / 16266 |
Budget-saturation incidence — share of problems whose completion reached ≥12k tokens
(at/near the thinking_token_budget=12288 cap). Saturation by itself is not rumination:
a saturated output that reaches a correct PASS is productive use of the budget. The pruned
variants saturate on nearly everything; 128e almost never does — but see the next table.
| Bench (n) | 128e | v5-coder | v6-coder |
|---|---|---|---|
| LCB-medium-55 | 1 / 55 (1.8%) | 54 / 55 (98.2%) | 53 / 55 (96.4%) |
| LCB-medium-100 | 2 / 100 (2.0%) | 99 / 100 (99.0%) | 97 / 100 (97.0%) |
Rumination — long thinking that fails to PASS. The right metric is not median length (128e looks short only because it answers easy problems fast). It is the share of the model's budget-saturated outputs that still fail — i.e. tokens burned without a correct answer:
| Bench (n) | 128e | v5-coder | v6-coder |
|---|---|---|---|
| LCB-medium-55 — saturated-and-failed | 1 / 1 (100%) | 8 / 54 (15%) | 2 / 53 (3.8%) |
| LCB-medium-100 — saturated-and-failed | 2 / 2 (100%) | 9 / 99 (9%) | 4 / 97 (4.1%) |
| LCB-100 — mean completion tokens, PASS vs FAIL | 1468 vs 8368 | 12687 vs 14241 | 12616 vs 14302 |
Every 128e output that reaches the budget cap is a failure (100%), and 128e's failed problems run 5.7× longer than its passed ones (mean 8368 vs 1468 tok): 128e only thinks long when it is lost. v6-coder thinks long on ~97% of problems but 96% of those long outputs PASS — its long thinking is overwhelmingly productive, not rumination.
Key findings:
- LCB — v6-coder ruminates less than 128e, under the only definition that matters (long
thinking that fails to reach a PASS). The naive read of the median says the opposite — 128e's
median LCB completion is ~0.6–1.2k tokens vs v6-coder's budget-capped ~12.7k — but that short
median is an artifact of mixing easy and hard problems. 128e only thinks long when it is lost,
and then it fails: every 128e output that saturates the budget is a failure (2/2 on LCB-100,
1/1 on LCB-55), and the single longest output across all three models — 51,568 chars / 15,585
tokens on
lcb/leetcode/3659— is a 128e failure, one that both v6-coder (PASS, 12,691 tok / 32,676 chars) and v5-coder solve in two-thirds the length. v6-coder, by contrast, saturates the budget on ~97% of problems but 96% of those long outputs PASS; its longest LCB-55 output is a correct answer and its max length (41–43k chars) sits below 128e's 51k. Wasted long-thinking rate: 128e 100%, v6-coder ~4%. And v6-coder beats 128e on score on both axes (96.36 / 96.00 vs 94.55 / 95.00) — more correct and a lower worst-case tail. - 128e's routing is erratic in both directions, and the targeted map repairs both. 128e
fails LCB two ways: by over-thinking (
3659: 51k chars, no answer) and by under-thinking (lcb/leetcode/3000: a premature wrong answer in 579 tokens;3793: 3101 tok) — on the very same problems v6-coder spends the full ~13k-token budget and passes. This bidirectional instability is the fingerprint of an imperfect base router; the 51k outliers and the lower 128e score are the symptom. v6-coder's LCB-targeted expert map corrects 3 of the 5 problems 128e fails on LCB-100 (2 of 3 on LCB-55). Thethinking_token_budget=12288forcing-function only bounds the loop; the targeted map is what makes the bounded thinking land on a PASS instead of a 51k-char dead end. (Versus v5-coder, v6-coder is statistically the same length — median 12799 vs 12804 tok on LCB-55 — yet passes far more: 53/55 vs 47/55, 96/100 vs 91/100.) - AIME — genuinely less rumination, +10pp. Median completion 939 tok vs v5-coder's 1156, p90 char tail 8404 < 9239, scoring 63.33% vs 53.33% — shorter chains and more correct.
- GSM8K — tightest of the three (max 396 tok; 128e 1107, v5-coder 858); no rumination tail.
- IFEval — no ruminating outlier (max 7200 chars vs v5-coder's 17539; ≈128e 7425).
- GPQA — a base-model ceiling, not a prune artifact. p90 completion tokens peg at ~8189 on all three models; the hard-question rumination is inherited from Gemma 4 and is identical across the cohort, not widened by pruning.
- ARC — one shared single-sample outlier (~16266 tok / ~44k chars) on both pruned variants, absent on 128e; p50/p90 normal and identical, so not systemic length growth.
- MultiPL-E — most compact code of the three. On the bench where the answer is the code (extracted final block, no reasoning trace), v6-coder has the tightest tail — max 2169 chars vs 128e 2944 / v5-coder 3083, p90 591 chars / 184 tok — at ≈128e pass-rate (macro 88.0%).
- Net: v6-coder is more accurate than 128e on every LCB axis (96.36 / 96.00 / MultiPL-E macro 88.0) at a higher average inference cost (it thinks the full bounded budget) but a lower worst case (max 41–43k chars vs 128e's 51k) and a far lower wasted-thinking rate (~4% vs 100%). On the short-reasoning benches (AIME / GSM8K / IFEval) it additionally reduces length vs v5-coder. The targeted map never traded length for accuracy — it traded 128e's erratic, sometimes-catastrophic tail for a higher but bounded and productive average.
Methodology / data integrity. Per-problem lengths come from omk_eval
token_statsover each bench's per-problem samples archive — lm-evalsamples_*.jsonlfor the 9-bench, and the native runners'lcb_result.samples.jsonl/mpe_result.samples.jsonl(mirrored into the durable sqlite resume cache) for LCB/MPE. MultiPL-E is tabulated, but its rows measure extracted-code length, not reasoning: its samples store only the final code block (no<think>trace,finish_reasonsall terminal,thinking_tokens_est=0), so MPE is a code-conciseness reference — not a rumination metric — on which v6-coder is the most compact. The reasoning-bearing benches (LCB / GPQA / AIME / MATH) carry the rumination signal; MPE tokens there are tokenizer-estimated (the code-extraction path returns no server token count).thinking_tokens_estis 0 across all benches under llama.cpp--reasoning-format deepseek(reasoning is emitted inline, not as a separable block), so the completion-token columns are the length signal. The LCB/MPEtoken_statsare computed over per-problem rows keyed ondoc_id(the native runners now emit it, fixed 2026-05-25) — earlier card revisions briefly collapsed these to n=1; all numbers here are at full n (LCB-55 n=55, LCB-100 n=100, MultiPL-E n=300).
Reproduce
# 1. drop map (omnimergekit)
python scripts/generate_drop_map_multiclass.py \
--data scripts/expert_neuron_v5_code_v3.json --target 98 \
--protect-top 16 --alpha 2.0 --strategy max --normalize rank \
--breadth-bonus 0.5 --v4-floor-map scripts/v4_layer_floor_map.json \
--baseline-drop-map scripts/teacher_force_98e_p16_clean.json \
--outlier-mode median --outlier-wnorm-thresh 1e4 \
--class-weights 1 1 3 1 1 0 0 2 \
--output scripts/v6coder_C6v3lcb_drop_map.json
# 2. prune + mandatory shared upweight (see scripts/build_v6coder_C6v3lcb.sh)
python scripts/expert_drop.py --source-dir google/gemma-4-26B-A4B-it \
--drop-map scripts/v6coder_C6v3lcb_drop_map.json --suffix=-v6-coder-C6v3lcb-it
python scripts/router_shared_upweight.py \
--model-dir google/gemma-4-A4B-98e-v6-coder-C6v3lcb-it \
--alpha 1.2 --target mlp.down_proj.weight
Lineage & provenance
- Base:
google/gemma-4-26B-A4B-it(128e, fresh prune — not derived from v4/v5 weights). - Cohort: v5-coder (C6
_fixed) → C6v3 (C6 v3-data) → C6v3lcb (this, LCB-targeted). - Tooling: omnimergekit —
expert_drop.py,generate_drop_map_multiclass.py,router_shared_upweight.py; eval viaomk_eval.py+ EVAL_PROTOCOL. - Eval sampler is frozen GREEDY across the whole cohort for apples-to-apples comparison.
License: Gemma Terms of Use. This is a derivative of Gemma 4 and inherits those terms.
- Downloads last month
- 104