YAML Metadata Warning:empty or missing yaml metadata in repo card

Check out the documentation for more information.

Laguna-XS.2 β†’ Dense β€” Smoke-Test Report (reconstruction pretraining)

First end-to-end validation of the dense reconstruction-pretraining loop for densifying poolside/Laguna-XS.2 (33B/3B-active MoE) into a ~3.3B dense student (cm2435-new/laguna-xs2-dense-k8-copied-shell). Teacher-forced, per-layer MSE+cosine, routed_dense-only trainable. Hardware: 1Γ— H100 80 GB.

loss curve

Config (smoke)

Layers trained 8 (subset β€” H100 memory; full 39 runs on GB300 / via Adafactor)
Steps 20 Β· seq 1024 Β· batch 1
Optimizer AdamW @ 2e-4
Loss mean_l( MSE + 0.05Β·(1βˆ’cos) ), attention-masked
Data nvidia/OpenCodeInstruct (streamed)

Result

metric step 1 step 20 Ξ”
total loss 0.0486 0.0332 βˆ’32 %
cosine-loss (1βˆ’cos) 0.949 0.575 (low 0.575) βˆ’33 %
mean MSE ~9e-4 ~1.4e-3 noisy (batch 1)
TOTAL LOSS (MSE + 0.05Β·cos)            COSINE-LOSS (1 βˆ’ cos to teacher)
0.0486 ●                               0.949 ●
0.0451  ●                              0.874  ●
0.0404    ●                            0.775    ●
0.0369     ●●                          0.725     ●●●
0.0335        ●●        ●              0.655        ●●       ●
0.0311           ● ●    ●              0.575           ●
       step 1 ──────────► 20                  step 1 ──────────► 20

Read

  • The loop works end-to-end on real teacher+student weights: teacher forward + hooks capture each MoE block's (x_l, y_l), student routed_dense_l(x_l) predicts, masked MSE+cosine backprops into routed_dense only.
  • Cosine-loss 0.95 β†’ 0.58 is the signal: random routed_dense rotates toward the teacher's output direction within 20 steps.
  • Cosine starting at ~0.95 confirms random init is near-orthogonal to the teacher β†’ motivates DO-ACP warm-start (concatenate selected experts) as the next lever.
  • MSE is noisy/non-monotone at batch 1; deeper layers carry larger magnitudes (seen at scale).

Status

  • βœ… Smoke validated β†’ scaled run launched: all 39 layers, Adafactor (fits 80 GB at 77 GB), eff-batch 2, 2000 steps (~8.2M tokens), checkpoints every 250.
  • Artifacts: loss_curve.png, metrics.jsonl (this repo). Full report + recipe comparison in companion gists.

Recipe: RADLADS step-1 (arXiv:2505.03005) / KRAFTON MoE→Dense feature-reconstruction (arXiv:2605.28207).

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Papers for EvanOLeary/laguna-xs2-densify-smoke