Laguna-XS.2-dense (Stage 1)

The Stage-1 init for a dense distillation of poolside/Laguna-XS.2 (33B MoE, ≈3B active) into a ≈3B dense model. Each of the 39 sparse MoE blocks is replaced by a single dense SwiGLU FFN (intermediate 4608) and trained per-layer, in parallel to match the teacher MoE block's output (RADLADS-style; teacher-fed inputs → no cross-layer error compounding). ≈90M tokens.

⚠️ Intermediate research artifact. This is the rough init — cross-layer error compounding is deliberately not corrected here (that's Stage 2's job). Held-out perplexity ≈25 (teacher ≈4.4); HumanEval pass@1 = 0.0%. Use laguna-xs2-dense-stage2 (KD-recovered) as the more capable checkpoint.

Loading

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
m = AutoModelForCausalLM.from_pretrained("poolside-laguna-hackathon/laguna-xs2-dense-stage1",
        trust_remote_code=True, torch_dtype=torch.bfloat16, device_map="cuda")
tok = AutoTokenizer.from_pretrained("poolside-laguna-hackathon/laguna-xs2-dense-stage1", trust_remote_code=True)

Dense FFNs (intermediate 4608) are zero-padded to 8192 so the stock modeling_laguna.py loads it (numerically identical); exported reports ≈3.8B, true model ≈3.0B. last.pt (raw Stage-1 FFN weights) is also in this repo. Footprint: ≈6 GB bf16 vs ≈67 GB for the 33B MoE (≈11× less weight VRAM).

See the Stage-2 card for the full method, results, and next steps. Code: https://github.com/postscarcity-inc/laguna-xs.2-dense

Downloads last month
38
Safetensors
Model size
4B params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for poolside-laguna-hackathon/laguna-xs2-dense-stage1

Finetuned
(23)
this model