Laguna-XS.2-dense

A ≈3B dense model distilled from poolside/Laguna-XS.2 — a 33B Mixture-of-Experts coding model with a ≈3B active path. We replace the MoE feed-forward layers with a single dense FFN of the same size as the active path (8 routed + 1 shared expert), turning the ≈3B active compute into a genuine ≈3B dense model that keeps XS.2's attention.

⚠️ Research / hackathon artifact — heavily under-trained. This checkpoint was produced in a time-boxed hackathon with a tiny distillation budget (≈22M assistant tokens). It is not production-ready. But after switching to chat-format KD it produces coherent, runnable code and scores 6.7% on HumanEval (up from 0%) — see Results. It demonstrates the method and is a starting point for longer distillation.

Method

Two stages, both distilling from the frozen FP8 XS.2 teacher:

  1. Stage 1 — per-layer MoE→dense init (RADLADS-style). Each of the 39 sparse MoE blocks is replaced by a dense SwiGLU FFN (intermediate 4608) and trained independently, in parallel to match the teacher MoE block's output (NMSE on the residual contribution), fed the teacher's own hidden states (no error compounding). ≈90M tokens. Result: a dense init with held-out perplexity ≈25 (vs teacher ≈4.4) — functional but rough, because cross-layer error compounding is left uncorrected by design.
  2. Stage 2 — synchronous logit-KD (this model). The stitched ≈3B dense student is trained end-to-end against the fp8 teacher's full-vocab logits (forward-KL). Two variants:
    • Raw-text KD (initial): 50/50 code+general raw text, packed. ≈14M tokens, KL 2.5 → 1.40. This destroyed the instruct behavior (see Diagnosis) — 0% on HumanEval.
    • Chat-format KD (the fix, separate repo): coding instruction→response examples rendered through the model's native chat template (special tokens, EOS-terminated turn), with the KL loss masked to the assistant tokens so the student learns to answer and stop. ≈22M assistant tokens on Magicoder-Evol-Instruct, one H100, KL → 0.87. Recovered coherent code and 6.7% HumanEval.

Data: raw-text KD used 50% DCLM + 50% StarCoder2/the-stack-v2-train; chat-format KD used Magicoder-Evol-Instruct-110K.

Stage-1 per-layer NMSE (all 39 dense FFNs converging against their MoE-block targets):

Stage-1 per-layer NMSE curves

Stage-2 KD loss (forward-KL teacher‖student over training, 2.5 → 1.40):

Stage-2 KD KL loss

Architecture / loading note

The dense FFNs are intermediate 4608 (layers 1–39) and 8192 (layer 0). For a uniform config that loads with the stock modeling_laguna.py, the 4608 FFNs are zero-padded to 8192 (numerically identical — silu(0)·0 = 0). So the exported checkpoint reports ≈3.8B params (padded); the true model is ≈3.0B.

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

m = AutoModelForCausalLM.from_pretrained(
    "poolside-laguna-hackathon/laguna-xs2-dense-stage2",
    trust_remote_code=True, torch_dtype=torch.bfloat16, device_map="cuda")
tok = AutoTokenizer.from_pretrained(
    "poolside-laguna-hackathon/laguna-xs2-dense-stage2", trust_remote_code=True)

ids = tok("def add(a, b):\n    ", return_tensors="pt").to("cuda")
print(tok.decode(m.generate(**ids, max_new_tokens=64)[0], skip_special_tokens=True))

(student_last.pt, the raw training state_dict, is also in this repo.)

Results

HumanEval pass@1 (greedy, evalplus), base / plus:

Model Params (resident) PPL HumanEval (raw completion) HumanEval (chat template)
Teacher (Laguna XS.2, fp8) 33B (3B active) ≈4.4 88.4% / 84.8%
Stage-1 dense ≈3B ≈25 0.0% 0.0%
Stage-2 dense, raw-text KD ≈3B 0.6% 0.0%
Stage-2 dense, chat-format KD ≈3B 6.7% / 6.1%

Headline: switching the Stage-2 distillation from raw text to the model's native chat format took the dense model from 0% → 6.7% pass@1 — the first non-trivial coding ability, on ≈22M assistant tokens. The chat-format checkpoint lives in its own repo.

Sample generation (chat-format KD)

Prompt: "Write a Python function is_prime(n) that returns True if n is prime." The chat-KD model returns a correct, documented implementation (and stops):

def is_prime(n):
    """Return True if n is a prime number, False otherwise."""
    if n < 2:
        return False
    for i in range(2, int(n**0.5) + 1):
        if n % i == 0:
            return False
    return True

The raw-text KD model, by contrast, emitted control-token spam (</think>…</assistant>) and never produced runnable code — hence its 0%.

Diagnosis & what fixed it

Symptom: the raw-text-KD dense model scored 0% on HumanEval in both eval formats, while the teacher scores a normal 88.4% with its chat template — so the harness was sound; the model was genuinely broken.

Root cause: XS.2 is an instruct/agentic model (chat template, special tokens, EOS-terminated turns), but we first distilled it on raw concatenated pretraining text (DCLM + code, packed). That pushed the student off its instruct distribution and it never learned to stop — generations degenerated (Stage-1: return n; repeated; raw-KD Stage-2 in chat mode: control-token spam </think>…</assistant>).

Fix (applied, and it worked): distill in the model's native chat format — coding instruction→response conversations rendered through chat_template.jinja, EOS-terminated, with the KL loss masked to the assistant tokens. Same KD loss/loop; only the data + tokenization changed. Result: coherent code and 6.7% HumanEval, up from 0%. The lesson: distilling an instruct model requires chat-format, EOS-terminated data — more raw tokens would not have fixed it.

What this release demonstrates: the MoE→dense architecture works (per-layer init converges, see the NMSE curves), the ≈11× weight-VRAM reduction (below) at matched active-compute, and that chat-format KD recovers usable instruct behavior. Closing the remaining gap to the teacher is a matter of more chat-format KD tokens (we used ≈22M; recovery budgets are typically 250M–4B).

VRAM / footprint

The MoE keeps all 33B params resident even though only ≈3B are active per token; the dense model keeps only the ≈3B.

Model Params resident bf16 weights fp8 weights
XS.2 (33B MoE) 33.4 B ≈67 GB ≈34 GB
XS.2-dense (stage_2) ≈3.0 B ≈6 GB ≈3 GB

≈11× smaller weight footprint (≈61 GB saved, bf16). Caveats: attention is unchanged, so KV-cache memory is identical to the teacher (all savings are in the weights); per-token FLOPs are ≈unchanged (the MoE was already ≈3B-active) — the win is memory/deployability, not speed. XS.2 needs an 80 GB-class GPU (or fp8 on 48 GB); the dense model fits a 16 GB consumer GPU. (The exported checkpoint here is zero-padded to ≈3.8B / 7.7 GB for stock-modeling compat; the true model is 3.0B / ≈6 GB.)

Limitations & next steps

  • Severely under-trained. ≈22M chat-KD assistant tokens is 10–200× below typical recovery budgets (RADLADS used 250–700M; MoE→dense work ≈4B). 6.7% HumanEval is a proof-of-life, not a usable coder yet.
  • Next: extend chat-format Stage 2 substantially — more Magicoder/teacher-generated conversations, ideally with cached teacher top-K logits to remove the teacher forward from the loop (3–5× throughput), reaching 100M+ assistant tokens. Then run the paper-faithful agentic evals (SWE-bench / Terminal-Bench via Harbor).

Code: https://github.com/postscarcity-inc/laguna-xs.2-dense · Stage-1 model

Downloads last month
30
Safetensors
Model size
4B params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for poolside-laguna-hackathon/laguna-xs2-dense-stage2-chat

Finetuned
(23)
this model