Attention Residuals
Paper โข 2603.15031 โข Published โข 185
YAML Metadata Warning:empty or missing yaml metadata in repo card
Check out the documentation for more information.
Replication of Attention Residuals (Kimi Team, arXiv:2603.15031) at 1B scale.
PreNorm baseline, 1B params
Block AttnRes (S=8, 9 blocks), 1B params
Architecture: 36 layers ร 1536 hidden ร 24 heads ร 6144 FFN, ctx=1024, GPT-2 BPE vocab (50304).
Trained for 15000 steps ร 98K tokens/step = 1.47B tokens of FineWeb-Edu sample-10BT, BF16, AdamW, cosine LR schedule, 8ร B200 NVLink-5.
Each is a torch.save dict containing model+optimizer state and the training config (, ). Load with:
Full results, plots, and code in the project's git tree (not uploaded here).