AtomixLabs/AtomixS2-5M-v1.0 · Just Curious...

Just Curious...

by GODELEV - opened about 11 hours ago

Discussion

GODELEV

about 11 hours ago

You have ,
Hidden dimension = 512
FFnn = 248 ?

Are you experimenting ?
Because I think your model relied on its attention for heavy lifting rather then on MLP.

MultivexAI

AtomixLabs org about 6 hours ago

•

edited about 6 hours ago

Yeah, that's correct. With a strict 5.98M parameter limit, dropping the hidden dimension below 512 causes representational collapse, while dropping layers below 4 turns the model into a basic Markov chain. Squeezing the MLP intermediate size down to 248 was the trade-off to keep both. Attention definitely has to carry most of the weight here.

Datdanboi25

about 5 hours ago

•

edited about 5 hours ago

probably worth dropping width and expanding ur mlp dimension to 2.7x hidden, that is usually the standard, interesting to see a model can produce results with such a small mlp though

MultivexAI

AtomixLabs org about 5 hours ago

Yeah, I considered that, but the math changes at 5.4M parameters. To get that 2.7x ratio, you'd have to sacrifice the 512 h_dim floor. Late 2025 scaling research The Optimal Architecture for Small Language Models shows that once you drop below 512 width, the model's accuracy falls off a cliff regardless of how big the MLP is. We're prioritizing representational rank over compute density here.

Datdanboi25

about 5 hours ago

For reference GPT-S2-5M is 192 hidden dim, 9 layers, and I think a 3x ratio and it does pretty well on benchmarks, worth a test possibly?

MultivexAI

AtomixLabs org about 5 hours ago

GPT-S2-5M is a cool project, but it actually proves the point about the bottleneck!

If you look at their README, they had to invent a custom 'XSA refresh gate' specifically to counteract 'diluted token identities.'
That 'dilution' is exactly the representational collapse that happens when h_dim is that low. They’re essentially 'hacking' the bottleneck by re-injecting the original embeddings into the layers because the 192-wide stream isn't high-rank enough to hold the information on its own. Since AtomixS2-5M-v1.0 is a native LLaMA-standard model without custom gates, we have to clear that 512 floor naturally to keep the identity from collapsing.

Datdanboi25

about 5 hours ago

Hmm somewhat yeah, however that is mainly a fix to repair the issue that XGQA causes since by forcing attn outputs to be orthogonal to token values the identity gets lost in the increasing residual magnitude and direction

MultivexAI

AtomixLabs org about 4 hours ago

•

edited about 4 hours ago

Fair point on XGQA exacerbating it. But the underlying physics are actually the same for standard architectures at that scale.

In a 512-dim space, the residual stream has enough volume to hold the original token embedding in one subspace while the attention heads write to another. At 192 dims, the attention updates physically run out of room and overwrite the token identity (like you said, getting lost in the residual magnitude). Since Aurelius/AtomixS2-v1.0-5M uses a native llama config, we can't use a refresh gate to save us when that happens, so we are hard-locked to the 512 width to give the stream enough room natively.

GODELEV

about 4 hours ago

Well , What's your future plan ?
Like going above to sub 100M param models

Datdanboi25

about 4 hours ago

Possibly, although GPT-S-5M also has XGQA, and same dim and depth and yet des well without the xsa refresh gate

MultivexAI

AtomixLabs org about 4 hours ago

Possibly, although GPT-S-5M also has XGQA, and same dim and depth and yet des well without the xsa refresh gate

That is the core contradiction, though: if the gate was strictly an XGQA fix and the v1 (GPT-S-5M) had XGQA but no gate and "did fine", then why did they invest the massive engineering overhead to design a custom gated layer with causal depthwise convolutions for the S2?

The reality is that the v1 was suffering from the exact same token identity dilution because it simply did not have a patch yet. A 192-dim space eventually runs out of mathematical room to hold the representation across deep layers.

MultivexAI

AtomixLabs org about 4 hours ago

Well , What's your future plan ?
Like going above to sub 100M param models

Yes, we absolutely plan to scale up. You can expect at least one model drop in the 20M to 100M parameter range before July 31st.

Our philosophy will remain the same. We plan to scale cleanly using standard, native architectures (respecting the 512+ hidden_size floor from the start) to ensure the model scales naturally. This avoids the need for custom architectural patches to prevent representational collapse as we increase the depth and width.

Datdanboi25

about 1 hour ago

I mean 34.75% avg score on slm leaderboard, It performs really quite well.

Harley-ml

about 1 hour ago

@MultivexAI Go to AxiomicLabs org page and scroll down to team members.

A good AI researcher tries to y'know, research about the model that they are making claims about. Well, my friend, you certainly did not.

MultivexAI

AtomixLabs org about 1 hour ago

@MultivexAI Go to AxiomicLabs org page and scroll down to team members.

A good AI researcher tries to y'know, research about the model that they are making claims about. Well, my friend, you certainly did not.

I am well aware, which is exactly why I am having this debate with him. Who better to discuss the specific trade offs of the T-X3 and T-X4 architectures with than the person who actually designed them? It has been a great technical discussion.

Harley-ml

38 minutes ago

•

edited 29 minutes ago

Tradeoffs? I doubt there is much.

Both architectures achieve state-of-art performace on several benchmarks.

Furthermore, compared to Llama-like architectures, it is significantly better in many ways.

But is it smart to debate someone who built the models your debating about? I wouldn't say so.

My question for you, my friend, did you run any ablations that actually showed improvements in benchmark scores with your 512-dim/0.5 ffn multiplier arch vs a standard 2.7 or 4x ffn multipler at the same param count?

I mean, Dillionv2 (qwen3 arch; basically llama) is well bellow 512-dim and it does quite good. Even Dillion, at 72 dims, does good(ish).

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment