Aurelius-Llama-v2.5-10M-Large

Part of The Aurelius TinyStories Collection, a specialized series of highly optimized, sub-10M parameter models trained exclusively on the TinyStories dataset. This specific model represents the "Large" scale of the collection, maximizing expressive density, syntactic accuracy, and contextual coherence within a strictly bounded 9.9M parameter footprint.

📝 Technical Log & Development Blog: Expanding Width, Securing Isometry

Following the development of the smaller Nano-scale prototype, the objective of the v2.5-Large run was to expand the hidden dimension width and layer depth to the mathematical ceiling of our local hardware-budget constraints, targeting a high-density 9.9M parameter configuration.

📋 Model Configuration

The architectural parameters for this model run are configured as follows:

Parameter	Value	Description
`model_type`	`llama`	Underlying transformer architecture
`num_hidden_layers`	`6`	Number of transformer decoder layers (depth)
`hidden_size`	`384`	Hidden dimension size (d_model)
`intermediate_size`	`1,008`	MLP gate/up projection dimension (aligned to 16-byte bounds)
`num_attention_heads`	`6`	Number of query attention heads
`num_key_value_heads`	`2`	Key-value heads (enables 3:1 ratio GQA)
`head_dim`	`64`	Vector dimension per attention head
`max_position_embeddings`	`416`	Context window size
`vocab_size`	`1,536`	Compressed target vocabulary size
`hidden_act`	`silu`	SwiGLU activation function
`tie_word_embeddings`	`true`	Shared input/output embedding representations
`rope_theta`	`812.5`	Custom rotary positional embedding base frequency
`attention_bias` / `mlp_bias`	`false`	Linear layer bias configuration
`bos_token_id` / `eos_token_id`	`0` / `1`	Special token mappings

🔬 Deep-Dive: Mitigating the Softmax Bottleneck at 384-Width

When a model's hidden dimension d_model is configured under 512 dimensions, it enters a critical mathematical danger zone known as the Softmax Bottleneck.

As investigated in language model saturation studies, when the rank of the hidden space is significantly narrower than the high-rank contextual probability distribution of the target language, standard projection layers saturate. Stacking more layers cannot compensate for this narrow width, causing the model's token representations to degrade.

Vocabulary Rank Alignment

Our model operates at a hidden dimension of d_model = 384. To prevent representation saturation, we compressed our vocabulary down to a highly dense, customized 1,536 tokens. By shrinking the vocabulary size to match the rank of our 384-wide hidden dimension, we mathematically aligned the rank of our prediction matrix with the lower-rank contextual probability distribution of the TinyStories corpus. This bypassed the degenerate representation trap, allowing our 6-layer model to train stably to completion.

The Attention Layout: Grouped-Query Attention

Our attention architecture utilizes Grouped-Query Attention (GQA) with 6 query heads and 2 key-value heads (a 3:1 head ratio). This layout forces the model to compress its keys and values into a shared latent subspace. This acts as a regularizer, preventing individual attention heads from developing isolated representations that lead to early overfitting on the pre-training corpus.

📚 The Data & Training Volume: High-Density Saturation

To train the 9.9M configuration, we loaded a pre-compiled dataset of 500,000 unique stories from the TinyStoriesV2 dataset, creating a base corpus of approximately 117 million unique tokens.

Tokens per Step: 64 sequences * 416 tokens/sequence = 26,624 tokens/step
Total Training Steps: 30,252 steps
Total Tokens Processed: 30,252 * 26,624 = 805,429,248 tokens

This represents approximately 6.9 full epochs over the 500,000-story corpus. This overtraining density—averaging over 81 tokens processed for every single parameter in the model—is what allowed the model to achieve its final 1.2232 validation loss (3.40 Perplexity) and secure stable grammatical convergence.

⚓ The Attention Sink Anchor

To prevent long-context attention degradation during inference, we engineered a dedicated Attention-Sink Anchor () into our dataset pipeline.

During data packaging, the token (ID 3) is force-prepended to Position 0 of every single sequence block. This gives the attention heads a permanent coordinate at step 0 to dump their unused attention activation energy, preventing the attention maps from blowing up as the sequence length grows.

Automatically Injecting the Sink via Tokenizer Post-Processing

To guarantee that the token is consistently assigned to Position 0 during both pre-training and downstream user inference, we configured a native "post_processor" directly in the tokenizer.json file.

Using Hugging Face's TemplateProcessing, the tokenizer automatically prepends the token, followed by the standard start-of-sequence token <s> (ID 0) to any input string. This eliminates the need for manual prompt modification, ensuring the attention heads always have a dedicated, permanent coordinate at step 0.

Here is the exact post-processing configuration embedded in the tokenizer:

  "post_processor": {
    "type": "TemplateProcessing",
    "single": [
      {
        "SpecialToken": {
          "id": "<sink>",
          "type_id": 0
        }
      },
      {
        "SpecialToken": {
          "id": "<s>",
          "type_id": 0
        }
      },
      {
        "Sequence": {
          "id": "A",
          "type_id": 0
        }
      }
    ],
    "pair": [
      {
        "SpecialToken": {
          "id": "<sink>",
          "type_id": 0
        }
      },
      {
        "SpecialToken": {
          "id": "<s>",
          "type_id": 0
        }
      },
      {
        "Sequence": {
          "id": "A",
          "type_id": 0
        }
      },
      {
        "SpecialToken": {
          "id": "</s>",
          "type_id": 0
        }
      },
      {
        "Sequence": {
          "id": "B",
          "type_id": 1
        }
      }
    ],
    "special_tokens": {
      "<s>": {
        "id": "<s>",
        "ids": [0],
        "tokens": ["<s>"]
      },
      "<sink>": {
        "id": "<sink>",
        "ids": [3],
        "tokens": ["<sink>"]
      }
    }
  }

By embedding this processing natively, any call to tokenizer(prompt) automatically maps the inputs to a <sink> <s> [Prompt] structure. The result is highly stable long-context attention maps, allowing the model to write full, 120+ token stories cleanly.

🛠️ The Suffix-Space Tokenizer (SST) & Trailing Normalization

To prevent first-step generation degradation caused by trailing spaces in user prompts, we adopted two core tokenizer philosophies:

Suffix-Space BPE (SST): Our tokenizer is built on a Suffix-Space philosophy. Words are tokenized with their trailing spaces attached (e.g., once , upon , a ). This matches the natural way punctuation marks attach directly to the preceding word without needing extra embedding dimensions to learn spacing rules.
Built-In Trailing-Space Normalization: We baked a custom trailing-space normalizer directly into the tokenizer.json pre-processing pipeline. We replaced the standard normalizer block with a two-step boundary replacement sequence:

  "normalizer": {
    "type": "Sequence",
    "normalizers": [
      {
        "type": "Replace",
        "pattern": {
          "Regex": "$"
        },
        "content": " "
      },
      {
        "type": "Replace",
        "pattern": {
          "Regex": "\\s\\s+$"
        },
        "content": " "
      }
    ]
  }

This sequence matches any trailing whitespace (or lack thereof) at the very end of the prompt string and standardizes it to exactly one space. This ensures the final word of your prompt is always parsed as a complete, cleanly spaced token, keeping the model's generation starting state within its training distribution.

🛡️ Important Disclaimer & Liability Limitation (Click to Expand)

This model is provided strictly "as is" and "with all faults," without warranty of any kind, express or implied.

Experimental & Research Nature Only: This model is an experimental, micro-scale prototype developed strictly for educational, scientific, and academic benchmarking purposes. Any stated "improvements" or "capabilities" are relative only to other micro-scale baselines and do not indicate suitability for production environments, commercial applications, or consumer-facing products.
No Safety Alignment: While the pre-training dataset (TinyStoriesV2) is conceptually designed around simple, child-like narratives, this model has not undergone safety tuning or toxic content filtering. It can output unpredictable, nonsensical, or potentially inappropriate text. Consequently, under no circumstances should this model or its outputs be deemed safe, verified, or appropriate for children or general public interaction.
User Assumption of Risk: Any output generated by this model is the result of statistical text completion and does not represent the views, opinions, or endorsements of the developers or hosting entities. The end-user assumes all liabilities and risks associated with running, testing, or utilizing the model or any downstream text generated by it.
Architectural and Trademark Clarification: The use of "Llama" in the model name refers solely to the underlying open-source mathematical architecture used to structure the network layers (such as RMSNorm, SwiGLU, and RoPE). This model is trained from scratch and is not affiliated with, endorsed by, or associated with Meta Platforms, Inc. or any of its affiliates.

🛠️ Usage & Integration

Because this model has been compiled into standard, native layers, you can load it using the standard Hugging Face transformers library with zero custom configurations or remote execution flags.

We recommend using a moderate temperature paired with Min-P sampling for the highest-fidelity outputs:

from transformers import AutoModelForCausalLM, AutoTokenizer

model_name = "MultivexAI/Aurelius-Llama-v2.5-10M-Large"

tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name)

# Note: You can omit the trailing space; the tokenizer normalizer will automatically handle it!
prompt = "Once upon a time, a little boy named Paco"
inputs = tokenizer(prompt, return_tensors="pt")

outputs = model.generate(
    **inputs, 
    max_new_tokens=128, 
    temperature=0.50, 
    min_p=0.15, 
    do_sample=True,
    pad_token_id=tokenizer.eos_token_id,
    eos_token_id=tokenizer.eos_token_id
)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

📚 References

Hugging Face / Studying Language Model Saturation via the Softmax Bottleneck (2024). "Why do small language models underperform? Studying Language Model Saturation via the Softmax Bottleneck."
Hugging Face (2025). "The Optimal Architecture for Small Language Models."
Xiao, G., et al. (2023). "Efficient Streaming Language Models with Attention Sinks." arXiv preprint arXiv:2309.17453.

Downloads last month: 18

Dataset used to train MultivexAI/Aurelius-Llama-v2.5-10M-Large-ONNX

Collection including MultivexAI/Aurelius-Llama-v2.5-10M-Large-ONNX

The Aurelius TinyStories Collection

Collection

A specialized series of highly optimized, sub-10M parameter models trained exclusively on the TinyStories dataset. • 7 items • Updated 7 days ago • 1

Paper for MultivexAI/Aurelius-Llama-v2.5-10M-Large-ONNX

Efficient Streaming Language Models with Attention Sinks

Paper • 2309.17453 • Published Sep 29, 2023 • 15