NeuroMamba v5-NoFastWeight -- FineWeb-Edu Validation

Architecture

5:1 sliding-window-local / sparse-global hybrid inspired by Gemma 4 + Jamba.

12 layers: 10x LocalBlock (sliding window GQA, w=128) + 2x GlobalBlock (full causal GQA, unified KV)
GQA: 6 query heads / 2 KV heads (3:1 compression)
SwiGLU FFN + RoPE + weight-tied lm_head
d_model=312, d_ffn=896, ~29M params

Model	Params	Best eval_loss	Perplexity	vs Baseline
GPT-2 style baseline	30M	4.3645	78.6	--
NeuroMamba v5	29M	4.4426	85.0	+0.0781 Below baseline

Step	Train loss	Eval loss	Tokens
500	8.0368	6.3743	16M
1,000	5.9692	5.6882	33M
1,500	5.5035	5.4098	49M
2,000	5.3346	5.2246	66M
2,500	5.1988	5.0980	82M
3,000	5.0113	5.0130	98M
3,500	4.9781	4.9107	115M
4,000	4.9170	4.8307	131M
4,500	4.7849	4.7754	147M
5,000	4.7676	4.7213	164M
5,500	4.7555	4.6904	180M
6,000	4.6316	4.6546	197M
6,500	4.6721	4.6264	213M
7,000	4.6754	4.6041	229M
7,500	4.5705	4.5837	246M
8,000	4.5953	4.5623	262M
8,500	4.6099	4.5519	279M
9,000	4.5118	4.5389	295M
9,500	4.5603	4.5234	311M
10,000	4.5763	4.5162	328M
10,500	4.4792	4.5065	344M
11,000	4.5335	4.4955	360M
11,500	4.5467	4.4885	377M
12,000	4.4512	4.4827	393M
12,500	4.5102	4.4778	410M
13,000	4.5238	4.4703	426M
13,500	4.4315	4.4659	442M
14,000	4.5033	4.4602	459M
14,500	4.5134	4.4567	475M
15,000	4.4316	4.4535	492M
15,500	4.4869	4.4502	508M
16,000	4.4946	4.4484	524M
16,500	4.4123	4.4462	541M
17,000	4.4870	4.4441	557M
17,500	4.4845	4.4433	573M
18,000	4.3965	4.4426	590M

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support