cayley-small-2L-mlp_in-20B
A 202.5M-parameter GPT with a 2-level CayleySAE inserted at mlp_in in
every transformer block, trained on 20B tokens of FineWeb-Edu. Part of the
20B campaign that standardizes all four cayley variants (small/large Γ 2L/3L)
on the same token budget β see
mh/reports/37-20B-campaign.md.
Headline
- Val loss: 3.1584 (iter 12716, final = best)
- Training tokens: 20B (FineWeb-Edu, sample-100BT 25B-token slice)
- Wall clock: ~10h 12m on 8Γ A100 80GB PCIe (vastai-8xA100)
This is the canonical 12L/d=1024 2L-CayleySAE model at the 20B budget. Its 3L
sibling markhenry/cayley-small-3L-mlp_in-20B
beats it by 0.025 nats (val 3.1330) at the same budget β the deeper hierarchy
helps. Both supersede the older 16k-iter
aemack-org/cayley-10b (val 3.173).
Backbone
- 12 transformer blocks, d_model 1024, 8 heads (head_dim 128)
- 202.5M total parameters
- seq_len 1024
- RMSNorm, learned absolute position embeddings (no RoPE)
CayleySAE
Inserted at mlp_in in every block: RMSNorm β CayleySAE β MLP. Output is
dense d=1024; sparsity lives in the intermediate code.
- 2 levels with hierarchy
10,16,0;15,32,256- L0:
n=10(1024 coords), k=16 - L1:
n=15(32k leaves), k=32, parent budget 256
- L0:
- 48 active features per token (16 + 32)
- Parameter-free algebraic dictionary; only per-feature biases are learned
cayley-per-parent-budget,cayley-score-standardize, andcayley-forward-standardized(the "zombie fix" β z-scores forwarded into reconstruction, see report 29 in repo) all enabled
n0 is forced to log2(d_model) = 10; this requires the t=10 primitive
polynomial in deeptopk/f2_algebra.py.
Training recipe
| Knob | Value |
|---|---|
| Optimizer | Muon (2D weights) + AdamW (embeddings, biases) |
| Peak Muon LR | 1.2e-2 |
| Min Muon LR | 1.5e-4 |
| Peak AdamW LR | 1.2e-2 (lockstep with Muon) |
| Min AdamW LR | 1.5e-4 |
| LR schedule | linear_warmdown |
warmdown_frac |
0.9 (super-Chinchilla) |
| Warmup iters | 200 |
| Batch size (per rank) | 32 |
| Gradient accumulation | 48 (global; 6 micro-steps Γ 8 ranks) |
| Tokens per iter | 1,572,864 |
| Total iters | 12,716 |
| World size | 8Γ A100 80GB PCIe |
| Dataset | FineWeb-Edu sample-100BT (25B-token slice) |
Warmup 0 β 200 iters; flat phase 200 β 1271 iters; warmdown 1271 β 12716 (linear 1.2e-2 β 1.5e-4 lockstep on both Muon and AdamW).
wandb: u25107g1
Recipe rationale
This is the v6-fullrun-floor recipe (lockstep Muon/AdamW 1.2e-2,
linear_warmdown wf=0.9) plus the zombie fix from report 29
(--cayley-forward-standardized). wf=0.9 is the super-Chinchilla setting
appropriate when D β₯ Chinchilla β at 20B tokens for a 202.5M-parameter model,
D = 99Γ Chinchilla. (Sub-Chinchilla configurations of the same backbone
prefer wf=0.2β0.5; see memory feedback_warmdown_frac_depends_on_saturation.)
Training health
| Signal | Outcome |
|---|---|
| Throughput | ~572k tok/s steady state on 8Γ A100 80GB PCIe |
| Peak VRAM | 33.83 GB / 80 GB (peak allocated, rank 0) |
| L0 dead features (of 1024) | 0 throughout |
| L1 dead features (of 32768) | 0 throughout |
| Training trajectory | Val descends monotonically; final eval = best |
Lightweight quick_evals (HellaSwag / LAMBADA / pile_ppl) were skipped due to
an import failed (No module named 'evals') in the training environment;
final-iter val_loss is the only eval recorded for this checkpoint.
Lineage
- Predecessor backbone:
aemack-org/cayley-10b(12L/d=1024, k16-2L-mlp_in, trained 16k iters / 10.5B tokens, val 3.173) - Same backbone, parity baseline:
markhenry/vanilla-v5-parity(val 3.173, cold-stopped at the cayley-10b val to measure interpretability deltas at matched loss) - 3L sibling at the same 20B budget:
markhenry/cayley-small-3L-mlp_in-20Bβ val 3.1330 (-0.025 nats vs this run) - Larger sibling (24L/d=2048, 2L): in progress under
cayley-large-2L-mlp_in-20B
Files
ckpt.ptβ best checkpoint, iter 12716 (= final)config.jsonβ training configuration consumed bysparse_nanogpt.traintrain_cayley_small_2L_mlp_in_20B.shβ exact training script for provenance
- Downloads last month
- 50