cayley-small-2L-mlp_in-20B

A 202.5M-parameter GPT with a 2-level CayleySAE inserted at mlp_in in every transformer block, trained on 20B tokens of FineWeb-Edu. Part of the 20B campaign that standardizes all four cayley variants (small/large × 2L/3L) on the same token budget — see mh/reports/37-20B-campaign.md.

Headline

Val loss: 3.1584 (iter 12716, final = best)
Training tokens: 20B (FineWeb-Edu, sample-100BT 25B-token slice)
Wall clock: ~10h 12m on 8× A100 80GB PCIe (vastai-8xA100)

This is the canonical 12L/d=1024 2L-CayleySAE model at the 20B budget. Its 3L sibling markhenry/cayley-small-3L-mlp_in-20B beats it by 0.025 nats (val 3.1330) at the same budget — the deeper hierarchy helps. Both supersede the older 16k-iter aemack-org/cayley-10b (val 3.173).

Backbone

12 transformer blocks, d_model 1024, 8 heads (head_dim 128)
202.5M total parameters
seq_len 1024
RMSNorm, learned absolute position embeddings (no RoPE)

CayleySAE

Inserted at mlp_in in every block: RMSNorm → CayleySAE → MLP. Output is dense d=1024; sparsity lives in the intermediate code.

2 levels with hierarchy 10,16,0;15,32,256
- L0: n=10 (1024 coords), k=16
- L1: n=15 (32k leaves), k=32, parent budget 256
48 active features per token (16 + 32)
Parameter-free algebraic dictionary; only per-feature biases are learned
cayley-per-parent-budget, cayley-score-standardize, and cayley-forward-standardized (the "zombie fix" — z-scores forwarded into reconstruction, see report 29 in repo) all enabled

n0 is forced to log2(d_model) = 10; this requires the t=10 primitive polynomial in deeptopk/f2_algebra.py.

Training recipe

Knob	Value
Optimizer	Muon (2D weights) + AdamW (embeddings, biases)
Peak Muon LR	1.2e-2
Min Muon LR	1.5e-4
Peak AdamW LR	1.2e-2 (lockstep with Muon)
Min AdamW LR	1.5e-4
LR schedule	linear_warmdown
`warmdown_frac`	0.9 (super-Chinchilla)
Warmup iters	200
Batch size (per rank)	32
Gradient accumulation	48 (global; 6 micro-steps × 8 ranks)
Tokens per iter	1,572,864
Total iters	12,716
World size	8× A100 80GB PCIe
Dataset	FineWeb-Edu sample-100BT (25B-token slice)

Warmup 0 → 200 iters; flat phase 200 → 1271 iters; warmdown 1271 → 12716 (linear 1.2e-2 → 1.5e-4 lockstep on both Muon and AdamW).

wandb: u25107g1

Recipe rationale

This is the v6-fullrun-floor recipe (lockstep Muon/AdamW 1.2e-2, linear_warmdown wf=0.9) plus the zombie fix from report 29 (--cayley-forward-standardized). wf=0.9 is the super-Chinchilla setting appropriate when D ≥ Chinchilla — at 20B tokens for a 202.5M-parameter model, D = 99× Chinchilla. (Sub-Chinchilla configurations of the same backbone prefer wf=0.2–0.5; see memory feedback_warmdown_frac_depends_on_saturation.)

Training health

Signal	Outcome
Throughput	~572k tok/s steady state on 8× A100 80GB PCIe
Peak VRAM	33.83 GB / 80 GB (peak allocated, rank 0)
L0 dead features (of 1024)	0 throughout
L1 dead features (of 32768)	0 throughout
Training trajectory	Val descends monotonically; final eval = best

Lightweight quick_evals (HellaSwag / LAMBADA / pile_ppl) were skipped due to an import failed (No module named 'evals') in the training environment; final-iter val_loss is the only eval recorded for this checkpoint.

Lineage

Predecessor backbone: aemack-org/cayley-10b (12L/d=1024, k16-2L-mlp_in, trained 16k iters / 10.5B tokens, val 3.173)
Same backbone, parity baseline: markhenry/vanilla-v5-parity (val 3.173, cold-stopped at the cayley-10b val to measure interpretability deltas at matched loss)
3L sibling at the same 20B budget: markhenry/cayley-small-3L-mlp_in-20B — val 3.1330 (-0.025 nats vs this run)
Larger sibling (24L/d=2048, 2L): in progress under cayley-large-2L-mlp_in-20B

Files

ckpt.pt — best checkpoint, iter 12716 (= final)
config.json — training configuration consumed by sparse_nanogpt.train
train_cayley_small_2L_mlp_in_20B.sh — exact training script for provenance

Downloads last month: 50

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Collection including markhenry/cayley-small-2L-mlp_in-20B

cayley canonical

Collection

7 items • Updated 22 days ago