Just Curious...

#1
by GODELEV - opened

You have ,
Hidden dimension = 512
FFnn = 248 ?

Are you experimenting ?
Because I think your model relied on its attention for heavy lifting rather then on MLP.

Yeah, that's correct. With a strict 5.98M parameter limit, dropping the hidden dimension below 512 causes representational collapse, while dropping layers below 4 turns the model into a basic Markov chain. Squeezing the MLP intermediate size down to 248 was the trade-off to keep both. Attention definitely has to carry most of the weight here.

probably worth dropping width and expanding ur mlp dimension to 2.7x hidden, that is usually the standard, interesting to see a model can produce results with such a small mlp though

AtomixLabs org

Yeah, I considered that, but the math changes at 5.4M parameters. To get that 2.7x ratio, you'd have to sacrifice the 512 h_dim floor. Late 2025 scaling research The Optimal Architecture for Small Language Models shows that once you drop below 512 width, the model's accuracy falls off a cliff regardless of how big the MLP is. We're prioritizing representational rank over compute density here.

For reference GPT-S2-5M is 192 hidden dim, 9 layers, and I think a 3x ratio and it does pretty well on benchmarks, worth a test possibly?

AtomixLabs org

GPT-S2-5M is a cool project, but it actually proves the point about the bottleneck!

If you look at their README, they had to invent a custom 'XSA refresh gate' specifically to counteract 'diluted token identities.'
That 'dilution' is exactly the representational collapse that happens when h_dim is that low. They’re essentially 'hacking' the bottleneck by re-injecting the original embeddings into the layers because the 192-wide stream isn't high-rank enough to hold the information on its own. Since AtomixS2-5M-v1.0 is a native LLaMA-standard model without custom gates, we have to clear that 512 floor naturally to keep the identity from collapsing.

Hmm somewhat yeah, however that is mainly a fix to repair the issue that XGQA causes since by forcing attn outputs to be orthogonal to token values the identity gets lost in the increasing residual magnitude and direction

Fair point on XGQA exacerbating it. But the underlying physics are actually the same for standard architectures at that scale.

In a 512-dim space, the residual stream has enough volume to hold the original token embedding in one subspace while the attention heads write to another. At 192 dims, the attention updates physically run out of room and overwrite the token identity (like you said, getting lost in the residual magnitude). Since Aurelius/AtomixS2-v1.0-5M uses a native llama config, we can't use a refresh gate to save us when that happens, so we are hard-locked to the 512 width to give the stream enough room natively.

Well , What's your future plan ?
Like going above to sub 100M param models

Possibly, although GPT-S-5M also has XGQA, and same dim and depth and yet des well without the xsa refresh gate

AtomixLabs org

Possibly, although GPT-S-5M also has XGQA, and same dim and depth and yet des well without the xsa refresh gate

That is the core contradiction, though: if the gate was strictly an XGQA fix and the v1 (GPT-S-5M) had XGQA but no gate and "did fine", then why did they invest the massive engineering overhead to design a custom gated layer with causal depthwise convolutions for the S2?

The reality is that the v1 was suffering from the exact same token identity dilution because it simply did not have a patch yet. A 192-dim space eventually runs out of mathematical room to hold the representation across deep layers.

AtomixLabs org

Well , What's your future plan ?
Like going above to sub 100M param models

Yes, we absolutely plan to scale up. You can expect at least one model drop in the 20M to 100M parameter range before July 31st.

Our philosophy will remain the same. We plan to scale cleanly using standard, native architectures (respecting the 512+ hidden_size floor from the start) to ensure the model scales naturally. This avoids the need for custom architectural patches to prevent representational collapse as we increase the depth and width.

I mean 34.75% avg score on slm leaderboard, It performs really quite well.

@MultivexAI Go to AxiomicLabs org page and scroll down to team members.

A good AI researcher tries to y'know, research about the model that they are making claims about. Well, my friend, you certainly did not.

AtomixLabs org

@MultivexAI Go to AxiomicLabs org page and scroll down to team members.

A good AI researcher tries to y'know, research about the model that they are making claims about. Well, my friend, you certainly did not.

I am well aware, which is exactly why I am having this debate with him. Who better to discuss the specific trade offs of the T-X3 and T-X4 architectures with than the person who actually designed them? It has been a great technical discussion.

Tradeoffs? I doubt there is much.

Both architectures achieve state-of-art performace on several benchmarks.

Furthermore, compared to Llama-like architectures, it is significantly better in many ways.

But is it smart to debate someone who built the models your debating about? I wouldn't say so.

My question for you, my friend, did you run any ablations that actually showed improvements in benchmark scores with your 512-dim/0.5 ffn multiplier arch vs a standard 2.7 or 4x ffn multipler at the same param count?

I mean, Dillionv2 (qwen3 arch; basically llama) is well bellow 512-dim and it does quite good. Even Dillion, at 72 dims, does good(ish).

Sign up or log in to comment