vanilla-24L2048-parity-cold

A 1.3B-parameter vanilla GPT (no SAE bottleneck) trained to val-loss parity with markhenry/cayley-24L2048-32k-2L-mlp_in-20b, the CayleySAE variant of the same backbone. This is the primary baseline for alignment-tax comparisons: same architecture minus the sparsity-enforcing bottleneck, cold-stopped at matching loss.

Numbers

cayley-24L2048-32k-2L-mlp_in-20b vanilla-24L2048-parity-cold
val_loss (CE) 2.7933 2.7926
tokens seen 20.0B 3.8B
pile_ppl 20.32 18.89
hellaswag_acc 0.383 0.379
lambada_acc 0.304 0.304

Val-loss delta: 0.0007 nats (inside ~0.005 eval noise floor). This is parity.

Architecture

  • 24 layers, 16 heads, d_model = 2048 (~1.3B params)
  • sparsity_mode = none (standard GPT, no CayleySAE)
  • Trained with Muon + AdamW (peak LR 1e-2, min LR 1e-3)

Training

  • Dataset: FineWeb-Edu-100B (same shard ordering as the Cayley run)
  • Schedule: parity_adaptive -- flat at peak LR until val enters target band, then 895-iter linear warmdown to min LR
  • Trigger: val <= 2.7933 + 0.207 = 3.0003 (fired at iter 1950)
  • Cold stop: iter 2900, LR = 1e-3
  • Hardware: 4x NVIDIA B200, ~32 min wall clock
  • Wandb: vanilla-24L2048-parity-fresh-v2-extend

Files

  • ckpt.pt -- PyTorch checkpoint (5.1 GB). Contains model, optimizer_states, config, model_config, iter_num, best_val_loss, wandb_step_offset, parity_trigger_iter.
  • config.json -- training config snapshot.

Loading

import torch
from sparse_nanogpt.model import GPT
from sparse_nanogpt.config import DeepTopKGPTConfig

ckpt = torch.load("ckpt.pt", map_location="cpu", weights_only=False)
model_config = DeepTopKGPTConfig(**ckpt["model_config"])
model = GPT(model_config)
model.load_state_dict(ckpt["model"])

Citation

Part of the Sparse NanoGPT project.

Downloads last month
2
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Dataset used to train markhenry/vanilla-24L2048-parity-cold