Laguna-XS.2 → Dense — Smoke-Test Report (reconstruction pretraining)

First end-to-end validation of the dense reconstruction-pretraining loop for densifying poolside/Laguna-XS.2 (33B/3B-active MoE) into a ~3.3B dense student (cm2435-new/laguna-xs2-dense-k8-copied-shell). Teacher-forced, per-layer MSE+cosine, routed_dense-only trainable. Hardware: 1× H100 80 GB.

Config (smoke)


Layers trained	8 (subset — H100 memory; full 39 runs on GB300 / via Adafactor)
Steps	20 · seq 1024 · batch 1
Optimizer	AdamW @ 2e-4
Loss	`mean_l( MSE + 0.05·(1−cos) )`, attention-masked
Data	nvidia/OpenCodeInstruct (streamed)

Result

metric	step 1	step 20	Δ
total loss	0.0486	0.0332	−32 %
cosine-loss (1−cos)	0.949	0.575 (low 0.575)	−33 %
mean MSE	~9e-4	~1.4e-3	noisy (batch 1)

TOTAL LOSS (MSE + 0.05·cos)            COSINE-LOSS (1 − cos to teacher)
0.0486 ●                               0.949 ●
0.0451  ●                              0.874  ●
0.0404    ●                            0.775    ●
0.0369     ●●                          0.725     ●●●
0.0335        ●●        ●              0.655        ●●       ●
0.0311           ● ●    ●              0.575           ●
       step 1 ──────────► 20                  step 1 ──────────► 20

Read

The loop works end-to-end on real teacher+student weights: teacher forward + hooks capture each MoE block's (x_l, y_l), student routed_dense_l(x_l) predicts, masked MSE+cosine backprops into routed_dense only.
Cosine-loss 0.95 → 0.58 is the signal: random routed_dense rotates toward the teacher's output direction within 20 steps.
Cosine starting at ~0.95 confirms random init is near-orthogonal to the teacher → motivates DO-ACP warm-start (concatenate selected experts) as the next lever.
MSE is noisy/non-monotone at batch 1; deeper layers carry larger magnitudes (seen at scale).

Status

✅ Smoke validated → scaled run launched: all 39 layers, Adafactor (fits 80 GB at 77 GB), eff-batch 2, 2000 steps (~8.2M tokens), checkpoints every 250.
Artifacts: loss_curve.png, metrics.jsonl (this repo). Full report + recipe comparison in companion gists.

Recipe: RADLADS step-1 (arXiv:2505.03005) / KRAFTON MoE→Dense feature-reconstruction (arXiv:2605.28207).

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Papers for EvanOLeary/laguna-xs2-densify-smoke

Pruning and Distilling Mixture-of-Experts into Dense Language Models

Paper • 2605.28207 • Published 18 days ago • 1

RADLADS: Rapid Attention Distillation to Linear Attention Decoders at Scale

Paper • 2505.03005 • Published May 5, 2025 • 35