NeuroMamba v5-NoFastWeight -- FineWeb-Edu Validation

Architecture

5:1 sliding-window-local / sparse-global hybrid inspired by Gemma 4 + Jamba.

  • 12 layers: 10x LocalBlock (sliding window GQA, w=128) + 2x GlobalBlock (full causal GQA, unified KV)
  • GQA: 6 query heads / 2 KV heads (3:1 compression)
  • SwiGLU FFN + RoPE + weight-tied lm_head
  • d_model=312, d_ffn=896, ~29M params

Results vs Baseline

Model Params Best eval_loss Perplexity vs Baseline
GPT-2 style baseline 30M 4.3645 78.6 --
NeuroMamba v5 29M 4.4426 85.0 +0.0781 Below baseline

Training

  • Dataset: FineWeb-Edu 10BT sample
  • Tokens: 600M / 600M
  • Hardware: L4x1 GPU

Training curve

Step Train loss Eval loss Tokens
500 8.0368 6.3743 16M
1,000 5.9692 5.6882 33M
1,500 5.5035 5.4098 49M
2,000 5.3346 5.2246 66M
2,500 5.1988 5.0980 82M
3,000 5.0113 5.0130 98M
3,500 4.9781 4.9107 115M
4,000 4.9170 4.8307 131M
4,500 4.7849 4.7754 147M
5,000 4.7676 4.7213 164M
5,500 4.7555 4.6904 180M
6,000 4.6316 4.6546 197M
6,500 4.6721 4.6264 213M
7,000 4.6754 4.6041 229M
7,500 4.5705 4.5837 246M
8,000 4.5953 4.5623 262M
8,500 4.6099 4.5519 279M
9,000 4.5118 4.5389 295M
9,500 4.5603 4.5234 311M
10,000 4.5763 4.5162 328M
10,500 4.4792 4.5065 344M
11,000 4.5335 4.4955 360M
11,500 4.5467 4.4885 377M
12,000 4.4512 4.4827 393M
12,500 4.5102 4.4778 410M
13,000 4.5238 4.4703 426M
13,500 4.4315 4.4659 442M
14,000 4.5033 4.4602 459M
14,500 4.5134 4.4567 475M
15,000 4.4316 4.4535 492M
15,500 4.4869 4.4502 508M
16,000 4.4946 4.4484 524M
16,500 4.4123 4.4462 541M
17,000 4.4870 4.4441 557M
17,500 4.4845 4.4433 573M
18,000 4.3965 4.4426 590M
Downloads last month
24
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support