YAML Metadata Warning:empty or missing yaml metadata in repo card

Check out the documentation for more information.

4995 paper โ€” Block AttnRes vs PreNorm baseline (1B)

Replication of Attention Residuals (Kimi Team, arXiv:2603.15031) at 1B scale.

Models

  •        PreNorm baseline, 1B params
    
  •  Block AttnRes (S=8, 9 blocks), 1B params
    

Architecture: 36 layers ร— 1536 hidden ร— 24 heads ร— 6144 FFN, ctx=1024, GPT-2 BPE vocab (50304).

Trained for 15000 steps ร— 98K tokens/step = 1.47B tokens of FineWeb-Edu sample-10BT, BF16, AdamW, cosine LR schedule, 8ร— B200 NVLink-5.

Each is a torch.save dict containing model+optimizer state and the training config (, ). Load with:

Full results, plots, and code in the project's git tree (not uploaded here).

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Paper for aquaqua/4995