Pruning and Distilling Mixture-of-Experts into Dense Language Models
Paper β’ 2605.28207 β’ Published β’ 1
YAML Metadata Warning:empty or missing yaml metadata in repo card
Check out the documentation for more information.
First end-to-end validation of the dense reconstruction-pretraining loop for densifying
poolside/Laguna-XS.2 (33B/3B-active MoE) into a ~3.3B dense student
(cm2435-new/laguna-xs2-dense-k8-copied-shell). Teacher-forced, per-layer MSE+cosine,
routed_dense-only trainable. Hardware: 1Γ H100 80 GB.
| Layers trained | 8 (subset β H100 memory; full 39 runs on GB300 / via Adafactor) |
| Steps | 20 Β· seq 1024 Β· batch 1 |
| Optimizer | AdamW @ 2e-4 |
| Loss | mean_l( MSE + 0.05Β·(1βcos) ), attention-masked |
| Data | nvidia/OpenCodeInstruct (streamed) |
| metric | step 1 | step 20 | Ξ |
|---|---|---|---|
| total loss | 0.0486 | 0.0332 | β32 % |
| cosine-loss (1βcos) | 0.949 | 0.575 (low 0.575) | β33 % |
| mean MSE | ~9e-4 | ~1.4e-3 | noisy (batch 1) |
TOTAL LOSS (MSE + 0.05Β·cos) COSINE-LOSS (1 β cos to teacher)
0.0486 β 0.949 β
0.0451 β 0.874 β
0.0404 β 0.775 β
0.0369 ββ 0.725 βββ
0.0335 ββ β 0.655 ββ β
0.0311 β β β 0.575 β
step 1 βββββββββββΊ 20 step 1 βββββββββββΊ 20
(x_l, y_l), student routed_dense_l(x_l) predicts, masked
MSE+cosine backprops into routed_dense only.routed_dense rotates toward the
teacher's output direction within 20 steps.loss_curve.png, metrics.jsonl (this repo). Full report + recipe comparison
in companion gists.Recipe: RADLADS step-1 (arXiv:2505.03005) / KRAFTON MoEβDense feature-reconstruction (arXiv:2605.28207).