cayley-small-2L-mlp_in-20B

A 202.5M-parameter GPT with a 2-level CayleySAE inserted at mlp_in in every transformer block, trained on 20B tokens of FineWeb-Edu. Part of the 20B campaign that standardizes all four cayley variants (small/large Γ— 2L/3L) on the same token budget β€” see mh/reports/37-20B-campaign.md.

Headline

  • Val loss: 3.1584 (iter 12716, final = best)
  • Training tokens: 20B (FineWeb-Edu, sample-100BT 25B-token slice)
  • Wall clock: ~10h 12m on 8Γ— A100 80GB PCIe (vastai-8xA100)

This is the canonical 12L/d=1024 2L-CayleySAE model at the 20B budget. Its 3L sibling markhenry/cayley-small-3L-mlp_in-20B beats it by 0.025 nats (val 3.1330) at the same budget β€” the deeper hierarchy helps. Both supersede the older 16k-iter aemack-org/cayley-10b (val 3.173).

Backbone

  • 12 transformer blocks, d_model 1024, 8 heads (head_dim 128)
  • 202.5M total parameters
  • seq_len 1024
  • RMSNorm, learned absolute position embeddings (no RoPE)

CayleySAE

Inserted at mlp_in in every block: RMSNorm β†’ CayleySAE β†’ MLP. Output is dense d=1024; sparsity lives in the intermediate code.

  • 2 levels with hierarchy 10,16,0;15,32,256
    • L0: n=10 (1024 coords), k=16
    • L1: n=15 (32k leaves), k=32, parent budget 256
  • 48 active features per token (16 + 32)
  • Parameter-free algebraic dictionary; only per-feature biases are learned
  • cayley-per-parent-budget, cayley-score-standardize, and cayley-forward-standardized (the "zombie fix" β€” z-scores forwarded into reconstruction, see report 29 in repo) all enabled

n0 is forced to log2(d_model) = 10; this requires the t=10 primitive polynomial in deeptopk/f2_algebra.py.

Training recipe

Knob Value
Optimizer Muon (2D weights) + AdamW (embeddings, biases)
Peak Muon LR 1.2e-2
Min Muon LR 1.5e-4
Peak AdamW LR 1.2e-2 (lockstep with Muon)
Min AdamW LR 1.5e-4
LR schedule linear_warmdown
warmdown_frac 0.9 (super-Chinchilla)
Warmup iters 200
Batch size (per rank) 32
Gradient accumulation 48 (global; 6 micro-steps Γ— 8 ranks)
Tokens per iter 1,572,864
Total iters 12,716
World size 8Γ— A100 80GB PCIe
Dataset FineWeb-Edu sample-100BT (25B-token slice)

Warmup 0 β†’ 200 iters; flat phase 200 β†’ 1271 iters; warmdown 1271 β†’ 12716 (linear 1.2e-2 β†’ 1.5e-4 lockstep on both Muon and AdamW).

wandb: u25107g1

Recipe rationale

This is the v6-fullrun-floor recipe (lockstep Muon/AdamW 1.2e-2, linear_warmdown wf=0.9) plus the zombie fix from report 29 (--cayley-forward-standardized). wf=0.9 is the super-Chinchilla setting appropriate when D β‰₯ Chinchilla β€” at 20B tokens for a 202.5M-parameter model, D = 99Γ— Chinchilla. (Sub-Chinchilla configurations of the same backbone prefer wf=0.2–0.5; see memory feedback_warmdown_frac_depends_on_saturation.)

Training health

Signal Outcome
Throughput ~572k tok/s steady state on 8Γ— A100 80GB PCIe
Peak VRAM 33.83 GB / 80 GB (peak allocated, rank 0)
L0 dead features (of 1024) 0 throughout
L1 dead features (of 32768) 0 throughout
Training trajectory Val descends monotonically; final eval = best

Lightweight quick_evals (HellaSwag / LAMBADA / pile_ppl) were skipped due to an import failed (No module named 'evals') in the training environment; final-iter val_loss is the only eval recorded for this checkpoint.

Lineage

  • Predecessor backbone: aemack-org/cayley-10b (12L/d=1024, k16-2L-mlp_in, trained 16k iters / 10.5B tokens, val 3.173)
  • Same backbone, parity baseline: markhenry/vanilla-v5-parity (val 3.173, cold-stopped at the cayley-10b val to measure interpretability deltas at matched loss)
  • 3L sibling at the same 20B budget: markhenry/cayley-small-3L-mlp_in-20B β€” val 3.1330 (-0.025 nats vs this run)
  • Larger sibling (24L/d=2048, 2L): in progress under cayley-large-2L-mlp_in-20B

Files

  • ckpt.pt β€” best checkpoint, iter 12716 (= final)
  • config.json β€” training configuration consumed by sparse_nanogpt.train
  • train_cayley_small_2L_mlp_in_20B.sh β€” exact training script for provenance
Downloads last month
50
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Collection including markhenry/cayley-small-2L-mlp_in-20B