4995 paper — Block AttnRes vs PreNorm baseline (1B)

Replication of Attention Residuals (Kimi Team, arXiv:2603.15031) at 1B scale.

Models

 Block AttnRes (S=8, 9 blocks), 1B params

Architecture: 36 layers × 1536 hidden × 24 heads × 6144 FFN, ctx=1024, GPT-2 BPE vocab (50304).

Trained for 15000 steps × 98K tokens/step = 1.47B tokens of FineWeb-Edu sample-10BT, BF16, AdamW, cosine LR schedule, 8× B200 NVLink-5.

Each is a torch.save dict containing model+optimizer state and the training config (, ). Load with:

Full results, plots, and code in the project's git tree (not uploaded here).

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support