Initial release: Shard-40m-v1 (54.5M dense transformer, anneal final)

Browse files

Files changed (8) hide show

README.md +131 -0
code/config.py +109 -0
code/model.py +373 -0
code/muon.py +198 -0
code/tokenizer.py +109 -0
models/model.pt +3 -0
models/pretrain.pt +3 -0
models/tokenizer.json +0 -0

README.md ADDED Viewed

	@@ -0,0 +1,131 @@

+---
+license: apache-2.0
+language:
+- en
+tags:
+- small-lm
+- gemma4-attention
+- muon
+- swiglu
+- experimental
+library_name: pytorch
+---
+# Shard-40m-v1
+A 54.5M parameter dense transformer trained on consumer-grade compute (Thunder Compute pretrain + Colab anneal). Released as a research artifact and pipeline-validation reference. Not a deployable model.
+This is the first checkpoint in the Shard series of small experimental transformers.
+## Architecture
+```
+Total params:        54,538,752 (~54.5M)
+Hidden dim:          512
+Layers:              12
+Attention heads:     8 (MHA, no GQA)
+Head dim:            64
+MLP intermediate:    2048 (SwiGLU)
+Vocab size:          8192
+Max sequence:        8192
+Attention pattern:   Gemma 4 alternating sliding window (window=1024) + global, last layer global
+Norm:                RMSNorm, pre-norm
+Position encoding:   RoPE on Q and K
+Embeddings:          tied input/output
+Activation:          SwiGLU
+MoE:                 none
+Engram:              none
+```
+## Training
+```
+Phase 1 (pretrain):
+  Compute:           Thunder Compute single GPU
+  Steps:             48,220 of a 100,000 step target (paused early)
+  Throughput:        86,800 tokens per second
+  Optimizer:         Muon for hidden 2D weights, AdamW for embeddings and norms
+  LR schedule:       WSD (warmup-stable-decay)
+  Stabilizers:       lm_head logit cap 30, z-loss coefficient 1e-4
+Phase 2 (anneal):
+  Compute:           Colab A100
+  Steps:             20,000 (full anneal complete)
+  Final cross-entropy: 3.27
+  Mix:               OpenWebMath, FineWeb-Edu carryover, NuminaMath, MetaMathQA, ArXiv, Cosmopedia, AI2 ARC
+```
+## Files
+- `models/model.pt` — anneal final checkpoint (model state only, 105 MB bf16)
+- `models/pretrain.pt` — pretrain step 47,500 (with optimizer state, 217 MB)
+- `models/tokenizer.json` — custom 8192-vocab BPE
+- `code/` — minimum loading code (model.py, config.py, tokenizer.py, muon.py)
+## How to load
+```python
+import sys, torch
+sys.path.insert(0, 'code')
+from config import Config
+from model import ToyLM
+from tokenizer import load_tokenizer
+ck = torch.load('models/model.pt', map_location='cpu', weights_only=False)
+cfg = Config(**ck['cfg']) if isinstance(ck['cfg'], dict) else ck['cfg']
+model = ToyLM(cfg).cuda().to(torch.bfloat16)
+model.load_state_dict(ck['model'])
+model.eval()
+tok = load_tokenizer('models/tokenizer.json')
+ids = torch.tensor([tok.encode('The capital of France is').ids], device='cuda')
+with torch.no_grad():
+    for _ in range(40):
+        logits, _ = model(ids)
+        nxt = logits[:, -1].argmax(-1, keepdim=True)
+        ids = torch.cat([ids, nxt], 1)
+print(tok.decode(ids[0].tolist()))
+```
+## Benchmark
+Greedy decode at 47 tokens per second on a single CUDA GPU. Model footprint 109 MB in bf16, 16 MB peak inference memory.
+Sampled outputs at temperature 0.7, top_p 0.9:
+| Prompt | Output |
+|---|---|
+| `The capital of France is` | `"covered by the Crown" (for example, the Great Seal of France...)` |
+| `To compute 12 plus 7, we can` | `now use the first 6 as a reversible input...` |
+| `Question: What is 23 + 19? Answer:` | `The answer is 23. Answer: 23. Answer: 23` (loops) |
+| `def fibonacci(n):` | `// Appendix A. - S. B. V. Shanker. - S. M. P. Gerber...` |
+| `Once upon a time, in a small village,` | `a woman is a gentleman in a village with an infinite wealth...` |
+| `Solve: 17 * 23 = ?` | `?????\n*****` (breakdown) |
+## What this artifact proves
+The training pipeline runs end to end on consumer-grade hardware. Muon + AdamW dual optimizer, WSD schedule, Gemma 4 alternating attention, anneal phase mixing math, code, and prose all stable. Loss decreases monotonically through pretrain. No NaN events, no divergence, no rank loss flagged by the Muon min-singular-value sentinel.
+## What this artifact cannot do
+Math (broken, hallucinates digits or loops). Code generation (gibberish). Factual grounding (hallucinates with grammatical confidence). Long-context retrieval (max sequence 8192 with sliding window 1024 means effective context is much shorter for non-global layers).
+## Why release it
+To document a reproducible recipe at this scale. The next iteration in this line moves to a 412M MoE with 3 routed experts, vocabulary 262144, distillation pretraining from frontier teachers, and a token budget that crosses the Chinchilla line. This artifact is the baseline against which that next model will be measured.
+## License
+Apache 2.0. Use freely. Attribution appreciated but not required.
+## Citation
+```
+@misc{shard40mv1,
+  author = {Shane (Crownelius)},
+  title  = {Shard-40m-v1: a 54.5M dense transformer trained on consumer compute},
+  year   = {2026},
+  publisher = {HuggingFace},
+  url    = {https://huggingface.co/CompactAI-O/Shard-40m-v1}
+}
+```

code/config.py ADDED Viewed

	@@ -0,0 +1,109 @@

+"""Config dataclass for the toy 50M LM.
+Scaled up from the toy_1m_gemma4_dsv4 baseline. Architectural levers stay the
+same (alternating SLIDE/GLOBAL Gemma 4 attention, optional Muon, optional
+512-slot Engram, full v2 stabilisation), only the shape numbers change.
+Two architectural variants are flag-gated:
+  attention_pattern:
+    "all_global" -- every layer is full causal attention (baseline).
+    "gemma4"     -- alternating SLIDE/GLOBAL across layers; last layer is GLOBAL.
+  optimizer:
+    "adamw" -- AdamW for everything (baseline).
+    "muon"  -- Muon for params with .dim() >= 2; AdamW for embeddings + 1D.
+  engram_enabled: optional 512-slot external memory bank with zero-init gate.
+When attention_pattern == "all_global" and optimizer == "adamw" and engram_enabled
+is False, training math is bit-identical to a plain causal transformer baseline.
+Defaults
+--------
+* vocab=8192 (up from 4096): fresh BPE on a larger FineWeb-edu sample.
+* dim=512, n_layers=12, n_heads=8, head_dim=64.
+* mlp_hidden=2048 (4x dim, SwiGLU).
+* max_seq_len=8192 (up from 4096).
+* sliding_window=1024 ("larger model" Gemma 4 tier; 1M used 512).
+* All v2 stabilisers ON: lm_head_logit_cap=30.0, z_loss_weight=1e-4, lr_schedule="wsd".
+"""
+from __future__ import annotations
+from dataclasses import dataclass
+from typing import Literal
+AttentionPattern = Literal["all_global", "gemma4"]
+OptimizerName = Literal["adamw", "muon"]
+LRSchedule = Literal["cosine", "wsd"]
+@dataclass
+class Config:
+    # ---------- model shape ----------
+    vocab_size: int = 8192
+    dim: int = 512
+    n_layers: int = 12
+    n_heads: int = 8
+    head_dim: int = 64  # n_heads * head_dim must equal dim
+    mlp_hidden: int = 2048
+    max_seq_len: int = 8192
+    # ---------- gemma4 SWA ----------
+    attention_pattern: AttentionPattern = "gemma4"
+    sliding_window: int = 1024
+    # ---------- engram (off by default) ----------
+    engram_enabled: bool = False
+    engram_slots: int = 512
+    engram_inject_layer: int = 6  # mid-stack for the 12-layer build
+    # ---------- training ----------
+    optimizer: OptimizerName = "muon"
+    rope_base: float = 10000.0
+    norm_eps: float = 1e-5
+    dropout: float = 0.0
+    tie_embeddings: bool = True
+    # ---------- CE stabilisation (Gemma-2 logit cap + PaLM z-loss) ----------
+    # ON by default at 50M scale -- the 1M project added these as a v2 bolt-on
+    # but at 50M with bf16 they're standard practice (DeepSeek V2/3, Gemma 2/3,
+    # PaLM). Bit-identical to the un-stabilised path when both knobs are 0/None.
+    lm_head_logit_cap: float | None = 30.0
+    z_loss_weight: float = 1e-4
+    # ---------- LR schedule ----------
+    # WSD by default at 50M (per Apr 2026 small-LM research; lets the head
+    # decay over the last 20 % of post-warmup, much smoother than cosine).
+    lr_schedule: LRSchedule = "wsd"
+    wsd_decay_frac: float = 0.2
+    # ---------- bookkeeping ----------
+    init_std: float = 0.02
+    def __post_init__(self) -> None:
+        assert self.n_heads * self.head_dim == self.dim, (
+            f"n_heads*head_dim={self.n_heads * self.head_dim} != dim={self.dim}"
+        )
+        assert self.attention_pattern in ("all_global", "gemma4")
+        assert self.optimizer in ("adamw", "muon")
+        assert self.lr_schedule in ("cosine", "wsd")
+        assert 0.0 <= self.wsd_decay_frac <= 1.0
+        assert self.z_loss_weight >= 0.0
+        assert self.lm_head_logit_cap is None or self.lm_head_logit_cap > 0
+        # Last layer must be GLOBAL when using gemma4 (canonical invariant).
+        # Concretely: layer i is GLOBAL iff (i % 2 == 1) for i in [0, n_layers).
+        # n_layers must be even, last index n_layers-1 must be odd.
+        if self.attention_pattern == "gemma4":
+            assert self.n_layers % 2 == 0 and self.n_layers >= 2, (
+                "gemma4 pattern requires even n_layers >= 2 so the last layer is GLOBAL"
+            )
+    def attention_kind(self, layer_idx: int) -> Literal["slide", "global"]:
+        """Return whether `layer_idx` is a sliding-window or global-attention layer."""
+        if self.attention_pattern == "all_global":
+            return "global"
+        # gemma4: even idx = SLIDE, odd idx = GLOBAL. Last layer (n_layers-1) is odd
+        # for any even n_layers, so it is GLOBAL.
+        return "global" if (layer_idx % 2 == 1) else "slide"

code/model.py ADDED Viewed

	@@ -0,0 +1,373 @@

+"""Toy 1M-param transformer with Gemma 4 alternating SWA + optional engram memory.
+Design notes
+------------
+* RMSNorm pre-norm, SwiGLU MLP, tied embedding/output (standard Llama-ish base).
+* Causal mask is precomputed; sliding-window layers use the same code path with
+  an additional window-restricted mask (purely a mask difference -- no kernel split).
+* RoPE is applied to Q/K only (standard, no Gemma 4 dual-RoPE).
+* Engram is an optional 512-slot static memory bank attended-to from one layer's
+  output; injected via a sigmoid gate that is zero-initialised so it's a no-op
+  at training start. Bit-identical to no-engram when `cfg.engram_enabled=False`.
+"""
+from __future__ import annotations
+import math
+from typing import Optional
+import torch
+import torch.nn as nn
+import torch.nn.functional as F
+from config import Config
+# ---------------------------------------------------------------------------
+# RMSNorm
+# ---------------------------------------------------------------------------
+class RMSNorm(nn.Module):
+    def __init__(self, dim: int, eps: float = 1e-5):
+        super().__init__()
+        self.weight = nn.Parameter(torch.ones(dim))
+        self.eps = eps
+    def forward(self, x: torch.Tensor) -> torch.Tensor:
+        # Compute in float32 for stability; cast back to input dtype.
+        dtype = x.dtype
+        xf = x.float()
+        rms = xf.pow(2).mean(-1, keepdim=True).add(self.eps).rsqrt()
+        return (xf * rms).to(dtype) * self.weight
+# ---------------------------------------------------------------------------
+# RoPE
+# ---------------------------------------------------------------------------
+def _build_rope_cache(seq_len: int, head_dim: int, base: float, device, dtype) -> tuple[torch.Tensor, torch.Tensor]:
+    assert head_dim % 2 == 0, "head_dim must be even for RoPE"
+    half = head_dim // 2
+    inv_freq = 1.0 / (base ** (torch.arange(0, half, device=device, dtype=torch.float32) / half))
+    t = torch.arange(seq_len, device=device, dtype=torch.float32)
+    freqs = torch.einsum("i,j->ij", t, inv_freq)  # (T, half)
+    cos = freqs.cos().to(dtype)
+    sin = freqs.sin().to(dtype)
+    return cos, sin  # each (T, half)
+def _apply_rope(x: torch.Tensor, cos: torch.Tensor, sin: torch.Tensor) -> torch.Tensor:
+    # x: (B, n_h, T, head_dim). cos/sin: (T, head_dim/2).
+    x1, x2 = x.chunk(2, dim=-1)
+    cos_b = cos[None, None, :, :]
+    sin_b = sin[None, None, :, :]
+    rotated_x1 = x1 * cos_b - x2 * sin_b
+    rotated_x2 = x1 * sin_b + x2 * cos_b
+    return torch.cat([rotated_x1, rotated_x2], dim=-1)
+# ---------------------------------------------------------------------------
+# Attention
+# ---------------------------------------------------------------------------
+class Attention(nn.Module):
+    """MHA with RoPE and configurable causal-or-sliding mask.
+    `kind == 'global'`: full causal attention.
+    `kind == 'slide'` : causal attention restricted to the last `window` tokens.
+    Both code paths use F.scaled_dot_product_attention for speed; the only
+    difference is the additive mask. When kind=='global' we pass `is_causal=True`
+    and skip building an explicit mask. When kind=='slide' we build a banded
+    mask that is bit-identical to the global path with appropriate -inf entries
+    outside the window.
+    """
+    def __init__(self, cfg: Config, kind: str):
+        super().__init__()
+        assert kind in ("global", "slide")
+        self.cfg = cfg
+        self.kind = kind
+        self.n_heads = cfg.n_heads
+        self.head_dim = cfg.head_dim
+        self.scale = self.head_dim**-0.5
+        self.W_q = nn.Linear(cfg.dim, cfg.dim, bias=False)
+        self.W_k = nn.Linear(cfg.dim, cfg.dim, bias=False)
+        self.W_v = nn.Linear(cfg.dim, cfg.dim, bias=False)
+        self.W_o = nn.Linear(cfg.dim, cfg.dim, bias=False)
+    def forward(self, x: torch.Tensor, cos: torch.Tensor, sin: torch.Tensor) -> torch.Tensor:
+        B, T, D = x.shape
+        q = self.W_q(x).view(B, T, self.n_heads, self.head_dim).transpose(1, 2)  # (B, H, T, Dh)
+        k = self.W_k(x).view(B, T, self.n_heads, self.head_dim).transpose(1, 2)
+        v = self.W_v(x).view(B, T, self.n_heads, self.head_dim).transpose(1, 2)
+        q = _apply_rope(q, cos, sin)
+        k = _apply_rope(k, cos, sin)
+        if self.kind == "global":
+            out = F.scaled_dot_product_attention(q, k, v, is_causal=True)
+        else:
+            # Banded causal mask: token t may attend to tokens in [max(0, t-window+1), t].
+            mask = _sliding_causal_mask(T, self.cfg.sliding_window, x.device, x.dtype)
+            out = F.scaled_dot_product_attention(q, k, v, attn_mask=mask, is_causal=False)
+        out = out.transpose(1, 2).contiguous().view(B, T, D)
+        return self.W_o(out)
+def _sliding_causal_mask(T: int, window: int, device, dtype) -> torch.Tensor:
+    """(T, T) additive mask: 0 inside window+causal, -inf outside.
+    Token i attends to j iff j <= i and (i - j) < window.
+    """
+    i = torch.arange(T, device=device).unsqueeze(1)  # (T,1)
+    j = torch.arange(T, device=device).unsqueeze(0)  # (1,T)
+    causal = j <= i
+    in_window = (i - j) < window
+    keep = causal & in_window
+    mask = torch.zeros((T, T), device=device, dtype=dtype)
+    mask = mask.masked_fill(~keep, float("-inf"))
+    # SDPA expects (..., T, T) broadcast over batch/heads.
+    return mask
+# ---------------------------------------------------------------------------
+# MLP (SwiGLU)
+# ---------------------------------------------------------------------------
+class SwiGLU(nn.Module):
+    def __init__(self, dim: int, hidden: int):
+        super().__init__()
+        self.w_gate = nn.Linear(dim, hidden, bias=False)
+        self.w_up = nn.Linear(dim, hidden, bias=False)
+        self.w_down = nn.Linear(hidden, dim, bias=False)
+    def forward(self, x: torch.Tensor) -> torch.Tensor:
+        return self.w_down(F.silu(self.w_gate(x)) * self.w_up(x))
+# ---------------------------------------------------------------------------
+# Block
+# ---------------------------------------------------------------------------
+class Block(nn.Module):
+    def __init__(self, cfg: Config, layer_idx: int):
+        super().__init__()
+        kind = cfg.attention_kind(layer_idx)
+        self.norm1 = RMSNorm(cfg.dim, eps=cfg.norm_eps)
+        self.attn = Attention(cfg, kind=kind)
+        self.norm2 = RMSNorm(cfg.dim, eps=cfg.norm_eps)
+        self.mlp = SwiGLU(cfg.dim, cfg.mlp_hidden)
+        self.kind = kind
+    def forward(self, x: torch.Tensor, cos: torch.Tensor, sin: torch.Tensor) -> torch.Tensor:
+        x = x + self.attn(self.norm1(x), cos, sin)
+        x = x + self.mlp(self.norm2(x))
+        return x
+# ---------------------------------------------------------------------------
+# Engram external memory
+# ---------------------------------------------------------------------------
+class Engram(nn.Module):
+    """Static memory bank with single-head attention readout + zero-init gate.
+    Bit-identical to no-engram at init (gate sigmoid is zero so injection is 0).
+    Becomes non-trivial only after the gate is trained away from zero.
+    """
+    def __init__(self, cfg: Config):
+        super().__init__()
+        self.cfg = cfg
+        # Slot rows are normalised by RMSNorm at read time.
+        self.slots = nn.Parameter(torch.randn(cfg.engram_slots, cfg.dim) * cfg.init_std)
+        self.q_proj = nn.Linear(cfg.dim, cfg.dim, bias=False)
+        self.k_proj = nn.Linear(cfg.dim, cfg.dim, bias=False)
+        self.v_proj = nn.Linear(cfg.dim, cfg.dim, bias=False)
+        self.o_proj = nn.Linear(cfg.dim, cfg.dim, bias=False)
+        self.norm = RMSNorm(cfg.dim, eps=cfg.norm_eps)
+        # Zero-init gate scalar -> sigmoid(0) = 0.5? No, we want exact no-op at init.
+        # Use a *raw* gate that we multiply rather than sigmoid; init to 0.
+        self.gate = nn.Parameter(torch.zeros(cfg.dim))
+    def forward(self, h: torch.Tensor) -> torch.Tensor:
+        # h: (B, T, D). Read from memory.
+        h_n = self.norm(h)
+        q = self.q_proj(h_n)            # (B, T, D)
+        k = self.k_proj(self.slots)     # (S, D)
+        v = self.v_proj(self.slots)     # (S, D)
+        scale = q.shape[-1] ** -0.5
+        attn = torch.einsum("btd,sd->bts", q, k) * scale
+        w = attn.softmax(dim=-1)
+        retrieved = torch.einsum("bts,sd->btd", w, v)
+        retrieved = self.o_proj(retrieved)
+        # Multiplicative zero-init gate -> exact no-op at init.
+        return h + self.gate * retrieved
+# ---------------------------------------------------------------------------
+# ToyLM
+# ---------------------------------------------------------------------------
+class ToyLM(nn.Module):
+    def __init__(self, cfg: Config):
+        super().__init__()
+        self.cfg = cfg
+        self.tok_emb = nn.Embedding(cfg.vocab_size, cfg.dim)
+        self.blocks = nn.ModuleList([Block(cfg, i) for i in range(cfg.n_layers)])
+        self.norm_f = RMSNorm(cfg.dim, eps=cfg.norm_eps)
+        if cfg.engram_enabled:
+            self.engram = Engram(cfg)
+        else:
+            self.engram = None
+        if not cfg.tie_embeddings:
+            self.lm_head = nn.Linear(cfg.dim, cfg.vocab_size, bias=False)
+        else:
+            self.lm_head = None
+        # RoPE cache; rebuilt lazily if the requested seq_len exceeds it.
+        cos, sin = _build_rope_cache(cfg.max_seq_len, cfg.head_dim, cfg.rope_base, device="cpu", dtype=torch.float32)
+        self.register_buffer("rope_cos", cos, persistent=False)
+        self.register_buffer("rope_sin", sin, persistent=False)
+        self._init_weights()
+    def _init_weights(self) -> None:
+        std = self.cfg.init_std
+        for p_name, p in self.named_parameters():
+            if p.dim() >= 2:
+                nn.init.normal_(p, mean=0.0, std=std)
+            elif p_name.endswith(".weight") and "norm" in p_name.lower():
+                nn.init.ones_(p)
+            elif p_name == "engram.gate":
+                nn.init.zeros_(p)
+            else:
+                nn.init.zeros_(p)
+    def forward(self, idx: torch.Tensor, targets: Optional[torch.Tensor] = None) -> tuple[torch.Tensor, Optional[torch.Tensor]]:
+        B, T = idx.shape
+        assert T <= self.cfg.max_seq_len, f"seq_len {T} > max {self.cfg.max_seq_len}"
+        x = self.tok_emb(idx)
+        cos = self.rope_cos[:T].to(device=x.device, dtype=x.dtype)
+        sin = self.rope_sin[:T].to(device=x.device, dtype=x.dtype)
+        for i, blk in enumerate(self.blocks):
+            x = blk(x, cos, sin)
+            if self.engram is not None and i == self.cfg.engram_inject_layer:
+                x = self.engram(x)
+        x = self.norm_f(x)
+        if self.cfg.tie_embeddings:
+            logits = F.linear(x, self.tok_emb.weight)
+        else:
+            logits = self.lm_head(x)
+        # Gemma-2 logit soft-cap (bf16 stability + bounded softmax input).
+        if self.cfg.lm_head_logit_cap is not None:
+            cap = self.cfg.lm_head_logit_cap
+            logits = cap * torch.tanh(logits / cap)
+        loss = None
+        if targets is not None:
+            loss = F.cross_entropy(
+                logits.reshape(-1, logits.size(-1)),
+                targets.reshape(-1),
+                ignore_index=-100,
+            )
+            # PaLM-style z-loss: penalises log-partition magnitude. Keeps the
+            # softmax denominator from drifting; small weight (~1e-4) costs ~0.
+            # Computed only on non-ignored positions so it composes with masked SFT.
+            if self.cfg.z_loss_weight > 0:
+                lse = torch.logsumexp(logits.float(), dim=-1)  # (B, T)
+                if targets is not None:
+                    valid = targets.reshape(*lse.shape) != -100
+                    if valid.any():
+                        z = (lse[valid] ** 2).mean()
+                    else:
+                        z = lse.new_zeros(())
+                else:
+                    z = (lse ** 2).mean()
+                loss = loss + self.cfg.z_loss_weight * z
+        return logits, loss
+    @torch.no_grad()
+    def generate(
+        self,
+        idx: torch.Tensor,
+        max_new_tokens: int = 80,
+        *,
+        temperature: float = 0.8,
+        top_p: float = 0.9,
+        rep_penalty: float = 1.3,
+        stop_token_ids: Optional[set[int]] = None,
+    ) -> torch.Tensor:
+        """Sampling-based decode with top-p + repetition penalty.
+        Defaults are tuned for sub-10M LMs: greedy alone collapses into
+        token-level repetition loops at this scale (entropy stays high but
+        argmax follows a self-amplifying trajectory). T=0.8 + top-p 0.9 +
+        rep_penalty=1.3 reliably breaks the loop without going incoherent.
+        Validated 2026-04-29 on the 12k-step toy 1M checkpoint.
+        Pass `temperature=0.0` to recover greedy (without rep_penalty).
+        """
+        self.eval()
+        for _ in range(max_new_tokens):
+            logits, _ = self(idx)
+            logits = logits[:, -1].float()  # (B, V)
+            if rep_penalty != 1.0:
+                # Per-batch element rep penalty over already-emitted tokens.
+                for b in range(idx.size(0)):
+                    seen = torch.unique(idx[b])
+                    pos = logits[b, seen] > 0
+                    logits[b, seen] = torch.where(pos,
+                                                  logits[b, seen] / rep_penalty,
+                                                  logits[b, seen] * rep_penalty)
+            if temperature <= 0.0:
+                nxt = logits.argmax(dim=-1, keepdim=True)
+            else:
+                logits = logits / temperature
+                if top_p < 1.0:
+                    sorted_logits, sorted_idx = logits.sort(descending=True)
+                    cum = F.softmax(sorted_logits, dim=-1).cumsum(dim=-1)
+                    mask = cum > top_p
+                    mask[..., 1:] = mask[..., :-1].clone()
+                    mask[..., 0] = False
+                    logits = logits.scatter(1, sorted_idx,
+                                            sorted_logits.masked_fill(mask, float('-inf')))
+                probs = F.softmax(logits, dim=-1)
+                nxt = torch.multinomial(probs, num_samples=1)
+            idx = torch.cat([idx, nxt], dim=1)
+            if stop_token_ids is not None and nxt[0, 0].item() in stop_token_ids:
+                break
+            if idx.size(1) >= self.cfg.max_seq_len:
+                break
+        return idx
+    def num_params_breakdown(self) -> dict[str, int]:
+        emb = sum(p.numel() for p in self.tok_emb.parameters())
+        attn = 0
+        mlp = 0
+        norms = 0
+        for blk in self.blocks:
+            attn += sum(p.numel() for p in blk.attn.parameters())
+            mlp += sum(p.numel() for p in blk.mlp.parameters())
+            norms += sum(p.numel() for p in blk.norm1.parameters())
+            norms += sum(p.numel() for p in blk.norm2.parameters())
+        norms += sum(p.numel() for p in self.norm_f.parameters())
+        engram = sum(p.numel() for p in self.engram.parameters()) if self.engram is not None else 0
+        head = sum(p.numel() for p in self.lm_head.parameters()) if self.lm_head is not None else 0
+        total = sum(p.numel() for p in self.parameters())
+        return {
+            "embedding": emb,
+            "attention": attn,
+            "mlp": mlp,
+            "norms": norms,
+            "engram": engram,
+            "lm_head_extra": head,  # 0 when tied
+            "total": total,
+        }

code/muon.py ADDED Viewed

	@@ -0,0 +1,198 @@

+"""Muon optimizer for 2D matrices.
+Reference: Keller Jordan, "Muon: An optimizer for hidden layers in neural networks"
+https://kellerjordan.github.io/posts/muon/
+Algorithm
+---------
+For each 2D parameter W with gradient G:
+    1. Maintain momentum buffer M_t = beta * M_{t-1} + G_t
+    2. Optionally apply Nesterov: G' = G_t + beta * M_t  (or just M_t without Nesterov)
+    3. Orthogonalise G' via 5 iterations of Newton-Schulz with the quintic polynomial
+       coefficients (3.4445, -4.7750, 2.0315):
+           X <- 3.4445 * X - 4.7750 * X X^T X + 2.0315 * (X X^T)^2 X
+       after first dividing X by ||X||_F to bring its singular values into [0, ~1.5].
+    4. Apply the orthogonalised update: W <- W - lr * adj_factor * O
+       where adj_factor = max(1, fan_out / fan_in)**0.5 to scale shorter-dim params.
+This optimiser is intended ONLY for parameters with .dim() >= 2. The recommended
+recipe uses AdamW for embeddings and 1D tensors (norms, biases). The wrapper
+class `HybridOptimizer` here packages that split.
+Bit-identical guarantee
+-----------------------
+When the caller selects optimizer="adamw" in Config, the train script never
+constructs Muon -- it builds a single AdamW over all params. The HybridOptimizer
+exists only when optimizer="muon"; it is not a sneaky pass-through. This keeps
+the two paths cleanly separated.
+"""
+from __future__ import annotations
+from typing import Iterable
+import torch
+from torch.optim import Optimizer
+# ---------------------------------------------------------------------------
+# Newton-Schulz orthogonalisation
+# ---------------------------------------------------------------------------
+@torch.no_grad()
+def newton_schulz_5(G: torch.Tensor, eps: float = 1e-7) -> torch.Tensor:
+    """Quintic Newton-Schulz, 5 iterations. Returns an approximately-orthogonal
+    matrix with the same shape as G.
+    Operates on the *transposed* shape if rows < cols so that XX^T stays the
+    smaller matrix-multiply (canonical optimisation in the reference impl).
+    """
+    assert G.dim() >= 2
+    a, b, c = 3.4445, -4.7750, 2.0315
+    X = G.float()  # do all NS math in fp32 even if param is bf16
+    if X.size(-2) > X.size(-1):
+        X = X.transpose(-2, -1)
+        transposed = True
+    else:
+        transposed = False
+    # Normalise so ||X||_op <= ~1.5. Frobenius norm is an upper bound on the
+    # spectral norm; dividing by it is safe and the standard choice.
+    X = X / (X.norm() + eps)
+    for _ in range(5):
+        A = X @ X.transpose(-2, -1)
+        B = b * A + c * (A @ A)
+        X = a * X + B @ X
+    if transposed:
+        X = X.transpose(-2, -1)
+    return X.to(G.dtype)
+# ---------------------------------------------------------------------------
+# Muon
+# ---------------------------------------------------------------------------
+class Muon(Optimizer):
+    def __init__(
+        self,
+        params: Iterable[torch.Tensor],
+        lr: float = 3e-3,
+        momentum: float = 0.95,
+        nesterov: bool = True,
+        weight_decay: float = 0.0,
+    ):
+        defaults = dict(lr=lr, momentum=momentum, nesterov=nesterov, weight_decay=weight_decay)
+        super().__init__(params, defaults)
+        for group in self.param_groups:
+            for p in group["params"]:
+                assert p.dim() >= 2, (
+                    f"Muon expects 2D+ params; got shape {tuple(p.shape)}. "
+                    "Wrap embeddings + 1D tensors with AdamW (use HybridOptimizer)."
+                )
+    @torch.no_grad()
+    def step(self, closure=None):
+        loss = closure() if closure is not None else None
+        for group in self.param_groups:
+            lr = group["lr"]
+            beta = group["momentum"]
+            nesterov = group["nesterov"]
+            wd = group["weight_decay"]
+            for p in group["params"]:
+                if p.grad is None:
+                    continue
+                g = p.grad
+                state = self.state[p]
+                if "momentum_buffer" not in state:
+                    state["momentum_buffer"] = torch.zeros_like(p)
+                buf = state["momentum_buffer"]
+                buf.mul_(beta).add_(g)
+                update = g + beta * buf if nesterov else buf
+                # Reshape ND tensors (e.g. conv kernels) into 2D for orthogonalisation.
+                # Embeddings are excluded by construction; here we expect Linear weights
+                # which are already 2D, but keep the reshape for safety.
+                orig_shape = update.shape
+                if update.dim() > 2:
+                    update = update.reshape(update.shape[0], -1)
+                ortho = newton_schulz_5(update)
+                # Scale by sqrt(max(1, fan_out/fan_in)) so updates have sane magnitude
+                # across rectangular shapes. fan_out = rows, fan_in = cols.
+                fan_out, fan_in = ortho.shape[-2], ortho.shape[-1]
+                adj = max(1.0, fan_out / fan_in) ** 0.5
+                if ortho.shape != orig_shape:
+                    ortho = ortho.reshape(orig_shape)
+                if wd != 0.0:
+                    p.add_(p, alpha=-lr * wd)
+                p.add_(ortho, alpha=-lr * adj)
+        return loss
+# ---------------------------------------------------------------------------
+# Hybrid Muon + AdamW wrapper
+# ---------------------------------------------------------------------------
+class HybridOptimizer:
+    """Routes 2D+ params to Muon and 1D / embedding params to AdamW.
+    Mimics the torch.optim.Optimizer surface enough for our train loop:
+    .step(), .zero_grad(set_to_none=True), .param_groups (for LR scheduling).
+    """
+    def __init__(
+        self,
+        named_params: Iterable[tuple[str, torch.nn.Parameter]],
+        muon_lr: float,
+        adamw_lr: float,
+        muon_momentum: float = 0.95,
+        adamw_betas: tuple[float, float] = (0.9, 0.95),
+        weight_decay: float = 0.0,
+    ):
+        muon_params = []
+        adamw_params = []
+        for name, p in named_params:
+            if not p.requires_grad:
+                continue
+            # Embeddings have dim() == 2 but should still go to AdamW per the recipe.
+            is_embedding = "tok_emb" in name or "engram.slots" in name
+            if p.dim() >= 2 and not is_embedding:
+                muon_params.append(p)
+            else:
+                adamw_params.append(p)
+        self.muon = Muon(
+            muon_params,
+            lr=muon_lr,
+            momentum=muon_momentum,
+            nesterov=True,
+            weight_decay=weight_decay,
+        )
+        self.adamw = torch.optim.AdamW(
+            adamw_params,
+            lr=adamw_lr,
+            betas=adamw_betas,
+            weight_decay=weight_decay,
+        )
+        self.param_groups = self.muon.param_groups + self.adamw.param_groups
+    def step(self, closure=None):
+        if closure is not None:
+            raise NotImplementedError("HybridOptimizer does not support a closure.")
+        self.muon.step()
+        self.adamw.step()
+    def zero_grad(self, set_to_none: bool = True):
+        self.muon.zero_grad(set_to_none=set_to_none)
+        self.adamw.zero_grad(set_to_none=set_to_none)
+    def state_dict(self):
+        return {"muon": self.muon.state_dict(), "adamw": self.adamw.state_dict()}
+    def load_state_dict(self, sd):
+        self.muon.load_state_dict(sd["muon"])
+        self.adamw.load_state_dict(sd["adamw"])

code/tokenizer.py ADDED Viewed

	@@ -0,0 +1,109 @@

+"""Train a fresh 8K BPE on a FineWeb-edu sample.
+This is the 50M-scale variant of the 1M project's 4K BPE. We bump the default
+vocab to 8192 and the document count to 50000 (was 50000 in 1M, kept the same
+because the 1M doc-count was already saturating BPE merge quality at 4K vocab
+-- doubling vocab needs roughly the same training set, not 2x more).
+We do NOT reuse any FANT tokenizer here -- the point of this experiment family
+is a clean small recipe with no external dependencies.
+Output: tokenizer.json in the working dir (or wherever specified).
+"""
+from __future__ import annotations
+import argparse
+import time
+from pathlib import Path
+from tokenizers import Tokenizer
+from tokenizers.models import BPE
+from tokenizers.pre_tokenizers import ByteLevel as BLPre
+from tokenizers.decoders import ByteLevel as BLDec
+from tokenizers.processors import ByteLevel as BLPost
+from tokenizers.trainers import BpeTrainer
+SPECIAL_TOKENS = [
+    "<|pad|>",      # 0
+    "<|bos|>",      # 1
+    "<|eos|>",      # 2
+    "<|unk|>",      # 3
+    "<|im_start|>", # 4 -- chat role open
+    "<|im_end|>",   # 5 -- chat role close
+]
+def _iter_fineweb(n_docs: int):
+    """Yield up to `n_docs` text strings from the FineWeb-edu streaming feed."""
+    from datasets import load_dataset
+    ds = load_dataset(
+        "HuggingFaceFW/fineweb-edu",
+        name="default",
+        split="train",
+        streaming=True,
+    )
+    n = 0
+    for ex in ds:
+        if n >= n_docs:
+            return
+        text = ex.get("text", "")
+        if isinstance(text, str) and text.strip():
+            n += 1
+            yield text
+def train_tokenizer(out_path: str = "tokenizer.json", vocab_size: int = 8192, n_docs: int = 50000) -> str:
+    tok = Tokenizer(BPE(unk_token="<|unk|>"))
+    tok.pre_tokenizer = BLPre(add_prefix_space=False)
+    tok.decoder = BLDec()
+    tok.post_processor = BLPost(trim_offsets=False)
+    trainer = BpeTrainer(
+        vocab_size=vocab_size,
+        special_tokens=SPECIAL_TOKENS,
+        initial_alphabet=BLPre.alphabet(),
+        show_progress=False,
+    )
+    print(f"[tokenizer] streaming up to {n_docs} FineWeb-edu docs...")
+    t0 = time.time()
+    docs = list(_iter_fineweb(n_docs))
+    print(f"[tokenizer] collected {len(docs)} docs in {time.time() - t0:.1f}s")
+    print(f"[tokenizer] training BPE vocab_size={vocab_size}...")
+    t0 = time.time()
+    tok.train_from_iterator(docs, trainer=trainer)
+    print(f"[tokenizer] trained in {time.time() - t0:.1f}s; vocab={tok.get_vocab_size()}")
+    out_dir = Path(out_path).parent
+    if str(out_dir) and not out_dir.exists():
+        out_dir.mkdir(parents=True, exist_ok=True)
+    tok.save(out_path)
+    print(f"[tokenizer] saved to {out_path}")
+    return out_path
+def load_tokenizer(path: str = "tokenizer.json") -> Tokenizer:
+    return Tokenizer.from_file(path)
+# Convenience accessors used by data.py / train.py
+def special_token_id(tok: Tokenizer, name: str) -> int:
+    tid = tok.token_to_id(name)
+    assert tid is not None, f"{name} not in tokenizer"
+    return tid
+def main():
+    ap = argparse.ArgumentParser()
+    ap.add_argument("--out", default="tokenizer.json")
+    ap.add_argument("--vocab", type=int, default=8192)
+    ap.add_argument("--docs", type=int, default=50000)
+    args = ap.parse_args()
+    train_tokenizer(args.out, args.vocab, args.docs)
+if __name__ == "__main__":
+    main()

models/model.pt ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:a037d9d847c357510b0a09d4dd6c169cacbd988dd24aba945c416a8f93397e7e
+size 109112123

models/pretrain.pt ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:06dfe158490abfe41644ebf7f44942d98af32ebe3602892e25117cb8c623c49a
+size 226660407

models/tokenizer.json ADDED Viewed

The diff for this file is too large to render. See raw diff