KitsuVp
/

NeoLLM

@@ -1,60 +1,197 @@
 ---
-library_name: transformers
 tags:
-- generated_from_trainer
-model-index:
-- name: NeoLLM
-  results: []
 ---
-<!-- This model card has been generated automatically according to the information the Trainer had access to. You
-should probably proofread and complete it, then remove this comment. -->
 # NeoLLM
-This model is a fine-tuned version of [](https://huggingface.co/) on an unknown dataset.
-It achieves the following results on the evaluation set:
-- Loss: 3.7868
-## Model description
-More information needed
-## Intended uses & limitations
-More information needed
-## Training and evaluation data
-More information needed
-## Training procedure
-### Training hyperparameters
-The following hyperparameters were used during training:
-- learning_rate: 0.0006
-- train_batch_size: 64
-- eval_batch_size: 64
-- seed: 42
-- optimizer: Use OptimizerNames.ADAMW_TORCH_FUSED with betas=(0.9,0.999) and epsilon=1e-08 and optimizer_args=No additional optimizer arguments
-- lr_scheduler_type: linear
-- lr_scheduler_warmup_steps: 0.1
-- num_epochs: 1
-### Training results
-| Training Loss | Epoch | Step  | Validation Loss |
-|:-------------:|:-----:|:-----:|:---------------:|
-| 4.3799        | 0.32  | 5000  | 4.3173          |
-| 4.0280        | 0.64  | 10000 | 3.9702          |
-| 3.8627        | 0.96  | 15000 | 3.7965          |
-| 3.8489        | 1.0   | 15625 | 3.7868          |
-### Framework versions
-- Transformers 5.5.3
-- Pytorch 2.11.0+cu130
-- Datasets 4.8.4
-- Tokenizers 0.22.2

 ---
+language: en
+license: apache-2.0
 tags:
+- causal-lm
+- research
+- fp8
+- attention
+- normalization
+- neollm
+datasets:
+- HuggingFaceFW/fineweb-edu
 ---
 # NeoLLM
+NeoLLM is a **135 M parameter** decoder-only language model trained from scratch on
+[FineWeb-Edu](https://huggingface.co/datasets/HuggingFaceFW/fineweb-edu) in **FP8**
+precision, completing training in approximately **6 hours** on a single NVIDIA RTX 5090.
+It integrates a collection of recently published attention and normalization techniques
+into a single architecture, with the goal of studying how they interact during
+pretraining. The model is actively being developed and the current checkpoint represents
+an intermediate training state.
+> **Author / contact:** [@Kyokopom](https://x.com/Kyokopom) on X
+> **Repository:** [KitsuVp/NeoLLM](https://huggingface.co/KitsuVp/NeoLLM)
+---
+## Architecture
+NeoLLM is a decoder-only transformer with the following configuration:
+| Parameter | Value |
+|---|---|
+| Hidden size | 512 |
+| Layers | 12 |
+| Attention heads | 8 |
+| KV heads (GQA) | 2 |
+| Head dim | 64 |
+| Intermediate size | 1536 |
+| Vocabulary | Qwen3 tokenizer (64,402 tokens) |
+| Context length | 512 tokens |
+### Parameter breakdown
+| Parameter bucket | Count |
+|---|---|
+| **Total parameters** | 79.58M (79,582,952) |
+| **Embedding parameters** (tied) | 32.97M (32,973,824) |
+| **Non-embedding parameters** | 46.61M (46,609,128) |
+| **Effective trainable parameters** | 79.58M (79,582,952) |
+> Weight tying is **enabled**: the input embedding matrix and the language-model head
+> share the same parameters, so the effective trainable budget is
+> `total − embed = 46.61M`.
+### Integrated techniques
+Each layer combines the following mechanisms simultaneously.
+**Normalization and residual stream**
+- **SeeDNorm** ([arXiv:2510.22777](https://arxiv.org/abs/2510.22777)) — Applied to Q and K
+  projections. Dynamically rescales the normalization based on the input's own statistics,
+  making the attention geometry more stable across varying input distributions.
+- **PolyNorm** ([arXiv:2602.04902](https://arxiv.org/abs/2602.04902)) — Replaces the standard
+  MLP activation with three branches: linear (x), quadratic (x²), and cubic (x³) — each
+  normalized and combined with learned weights. This allows the MLP to express both linear
+  and non-linear relationships simultaneously.
+- **GPAS** ([arXiv:2506.22049](https://arxiv.org/abs/2506.22049)) — Gradient-Preserving
+  Activation Scaling. Applied to residual connections between sublayers; helps gradients
+  flow more cleanly during training without distorting the residual stream.
+- **LayerNorm Scaling / LNS** ([arXiv:2502.05795](https://arxiv.org/abs/2502.05795)) — Each
+  layer's output is scaled by 1/√ℓ where ℓ is the layer index. Directly addresses the
+  "Curse of Depth" in Pre-LN transformers.
+**Attention mechanisms**
+- **FAN** ([arXiv:2502.21309](https://arxiv.org/abs/2502.21309)) — Fourier Analysis Networks.
+  A portion of the input projection channels are dedicated to representing periodic patterns
+  (cosine/sine pairs), while the remainder handle standard linear content.
+- **MEA** ([arXiv:2601.19611](https://arxiv.org/abs/2601.19611)) — Explicit Multi-head
+  Attention. Adds small learnable interaction matrices between attention heads for K and V.
+- **LUCID** ([arXiv:2602.10410](https://arxiv.org/abs/2602.10410)) — Applies a learned
+  lower-triangular preconditioner to V before attention, decorrelating value representations
+  across positions.
+- **Affine-Scaled Attention** ([arXiv:2602.23057](https://arxiv.org/abs/2602.23057)) — Adds
+  two learnable per-head scalars (α and β) to the softmax weights:
+  `[α·softmax(QKᵀ) + β]·V`.
+- **XSA** ([arXiv:2603.09078](https://arxiv.org/abs/2603.09078)) — Exclusive Self Attention.
+  After computing attention, removes the component of the output aligned with the token's
+  own value vector.
+- **Directional Routing** ([arXiv:2603.14923](https://arxiv.org/abs/2603.14923)) — Each head
+  learns K=4 directions in the output space; a learned router suppresses the attention output
+  along each direction per input.
+- **Gated Attention** ([arXiv:2505.06708](https://arxiv.org/abs/2505.06708)) — A sigmoid gate
+  is applied to the attention output before the output projection, introducing non-linearity
+  and preventing attention sinks.
+- **Momentum Attention** ([arXiv:2411.03884](https://arxiv.org/abs/2411.03884)) — Modifies Q
+  and K by subtracting a fraction of the previous position's Q and K values (causal
+  first-difference).
+**MLP**
+- **Learnable Multipliers** ([arXiv:2601.04890](https://arxiv.org/abs/2601.04890)) — Adds
+  per-row and per-column learnable scalar parameters to each linear layer.
+- **SimpleGPT** ([arXiv:2602.01212](https://arxiv.org/abs/2602.01212)) — A normalization
+  strategy derived from second-order geometry analysis, applied inside MLP projections to
+  improve optimization stability.
+---
+## Training
+| Setting | Value |
+|---|---|
+| Dataset | FineWeb-Edu (sample-10BT) |
+| Tokens seen | ~0.51B (15,625 steps × batch 64 × length 512) |
+| Precision | FP8 native (E4M3 weights/activations, E5M2 gradients) + BF16 fallback |
+| Optimizer | Conda (Column-Normalized Adam) + GPA |
+| Learning rate | 6e-04 with linear warmup (10 % of steps) |
+| Weight decay | 0.1 |
+| Training time | ~1h 22m |
+| Hardware | NVIDIA RTX 5090 (single GPU) |
+### Training curve
+| Step | Train Loss | Val Loss |
+|---|---|---|
+| 5,000 | 4.380 | 4.317 |
+| 10,000 | 4.028 | 3.970 |
+| 15,000 | 3.863 | 3.797 |
+| 15,625 | — | 3.787 |
+---
+## Limitations
+- **Token budget** — ~1.5 B tokens seen; below estimated optimum. Knowledge-intensive tasks
+  will improve with more training.
+- **Gradient spike at step 40k** — Reorganized the attention pattern in layer 9 that
+  previously captured long-range token correlations. A checkpoint from ~step 38k is expected
+  to have better aggregate benchmark scores.
+- **PolyNorm exclusivity** — The quadratic branch has become partially redundant with the
+  linear branch. Will be corrected in the next training run.
+- **Base model only** — Not instruction-tuned or aligned; purely a next-token-prediction
+  base model.
+---
+## References
+All papers whose techniques are integrated into NeoLLM's architecture:
+| Technique | Paper title | arXiv |
+|---|---|---|
+| SeeDNorm | Self-Rescaled Dynamic Normalization | [2510.22777](https://arxiv.org/abs/2510.22777) |
+| MEA | Explicit Multi-head Attention | [2601.19611](https://arxiv.org/abs/2601.19611) |
+| Learnable Multipliers | Freeing the Scale of Language Model Matrix Layers | [2601.04890](https://arxiv.org/abs/2601.04890) |
+| Directional Routing | Directional Routing in Transformers | [2603.14923](https://arxiv.org/abs/2603.14923) |
+| XSA | Exclusive Self Attention | [2603.09078](https://arxiv.org/abs/2603.09078) |
+| Gated Attention | Gated Attention for LLMs | [2505.06708](https://arxiv.org/abs/2505.06708) |
+| Affine-Scaled Attention | Affine-Scaled Attention | [2602.23057](https://arxiv.org/abs/2602.23057) |
+| LNS | The Curse of Depth in LLMs | [2502.05795](https://arxiv.org/abs/2502.05795) |
+| LUCID | Attention with Preconditioned Representations | [2602.10410](https://arxiv.org/abs/2602.10410) |
+| FAN | Fourier Analysis Networks | [2502.21309](https://arxiv.org/abs/2502.21309) |
+| SimpleGPT | SimpleGPT | [2602.01212](https://arxiv.org/abs/2602.01212) |
+| GPAS | Gradient-Preserving Activation Scaling | [2506.22049](https://arxiv.org/abs/2506.22049) |
+| PolyNorm | PolyNorm / PolyCom | [2602.04902](https://arxiv.org/abs/2602.04902) |
+| Momentum Attention | Momentum Attention | [2411.03884](https://arxiv.org/abs/2411.03884) |
+| TWEO (analysis ref.) | Transformers Without Extreme Outliers | [2511.23225](https://arxiv.org/abs/2511.23225) |
+---
+## Citation
+```bibtex
+@misc{neollm2026,
+  title  = {NeoLLM: A Research Language Model Integrating Recent Attention and Normalization Techniques},
+  author = {KitsuVp},
+  year   = {2026},
+  url    = {https://huggingface.co/KitsuVp/NeoLLM}
+}
+```
+---
+## Author
+[@Kyokopom](https://x.com/Kyokopom) on X
+---
+## License
+Apache 2.0