- Aurelius-Llama-v2.0-1.5M-Nano
- ๐ Technical Log & Development Blog: Shrinking the Footprint, Keeping the Coherence
- ๐๏ธ Fitting inside the L2 Cache: The 1.49M Footprint
- ๐ Model Configuration
- ๐ฌ Deep-Dive: Overcoming the 512-Dimension Softmax Bottleneck
- ๐ The Data & Training Volume: High-Density Saturation
- โ The Attention Sink Anchor
- ๐ ๏ธ The Suffix-Space Tokenizer (SST) & Trailing Normalization
- ๐คช The Blooper Reel: Early-Run Failure Logs
- ๐ง Honest Limitations (Where Does It Break?)
- ๐๏ธ Side-by-Side Generations (Evaluating under
temp=0.35,min_p=0.10) - ๐ ๏ธ Usage & Integration
- ๐ References
- ๐๏ธ Fitting inside the L2 Cache: The 1.49M Footprint
Aurelius-Llama-v2.0-1.5M-Nano
Part of The Aurelius TinyStories Collection, a specialized series of highly optimized, sub-10M parameter models trained exclusively on the TinyStories dataset. This collection explores the boundaries of grammatical coherence, narrative depth, and reasoning capacity at micro-scale.
๐ Technical Log & Development Blog: Shrinking the Footprint, Keeping the Coherence
Hey everyone! Here is the technical breakdown of what Iโve been working on.
If you followed my previous runs, you know I had a lot of fun building Aurelius v0.5 (3.2M) and Aurelius v1.0 (4.0M). They were great proofs-of-concept, but I couldnโt stop staring at the parameter counts. I kept asking myself how much of that capacity was actually doing work, and how much was just structural bloat.
The Aurelius Sandbox: What I'm Doing Here
The Aurelius collection is my personal playground. I use it to stress-test my pre-training code, push my local GPU hardware to its limit, experiment with custom layers, and critically analyze my outputs to see where the physical limits of language break down on micro-scale parameters.
After releasing the earlier checkpoints, I wanted to see if I could build a heavily compressed, high-density versionโcutting active parameters to under 1.5M (practically half the size of v0.5) while targeting similar story-writing quality as the 4.0M v1.0 model.
๐๏ธ Fitting inside the L2 Cache: The 1.49M Footprint
At 1.49M parameters, the weights file is under 6 megabytes on disk.
- It occupies less space than a standard photograph taken on a modern smartphone.
- The entire weight matrix is small enough to fit inside the L2 or L3 cache of a CPU, meaning it can run without relying on system RAM bottlenecks.
- You could easily run this model natively inside the micro-controller of an offline IoT sensor or a basic smart device.
๐ Model Configuration
The complete architectural parameters for this model run are structured as follows:
| Parameter | Value | Description |
|---|---|---|
model_type |
llama |
Underlying transformer architecture |
num_hidden_layers |
4 |
Number of transformer decoder layers (depth) |
hidden_size |
176 |
Hidden dimension size ($d_{model}$) |
intermediate_size |
432 |
MLP gate/up projection dimension |
num_attention_heads |
4 |
Number of query attention heads |
num_key_value_heads |
1 |
Key-value heads (enables 4:1 ratio GQA) |
head_dim |
44 |
Vector dimension per attention head |
max_position_embeddings |
320 |
Context window size |
vocab_size |
1,536 |
Compressed target vocabulary size |
hidden_act |
silu |
SwiGLU activation function |
tie_word_embeddings |
true |
Shared input/output embedding representations |
rope_theta |
600.0 |
Custom rotary positional embedding base frequency |
attention_bias / mlp_bias |
false |
Linear layer bias configuration |
bos_token_id / eos_token_id |
0 / 1 |
Special token mappings |
๐ฌ Deep-Dive: Overcoming the 512-Dimension Softmax Bottleneck
After releasing Aurelius v0.5 and v1.0, I got curious about the theoretical scaling limits of micro-architectures. I wanted to understand why language models with extremely small hidden dimensions often underperform, even when trained on clean, simplified datasets like TinyStories.
During my research, I found an excellent paper published by researchers on Hugging Face: "Why do small language models underperform? Studying Language Model Saturation via the Softmax Bottleneck".
The paper investigates why small models underperform relative to their parameter scaling trends. The authors discovered a fundamental architectural trap: the Softmax Bottleneck. When a model's hidden dimension d_model is too smallโspecifically under roughly 1000 dimensions, and most severely at or below 512 dimensionsโthere is a severe mathematical mismatch between the low rank of the hidden space and the high rank of the target contextual probability distribution of natural language.
Furthermore, a late-2025 study on Hugging Face, "The Optimal Architecture for Small Language Models", established a bimodal performance distribution of small models (clustering into ~38% vs ~32% accuracy tiers). It proved that hidden_size >= 512 is a hard physical threshold. Below 512 dimensions, the model falls off a cliff. Crucially, the paper proved that even stacking the model to an extremely deep 32 or 64 layers cannot compensate for a narrow width.
Applying the Insights: Vocabulary Rank Reduction
Our model operates at a super-narrow hidden dimension of d_model = 176, putting us deep inside that critical 512-dimension danger zone. Making the model wider (d_model > 512) would blow my parameter budget.
Instead of focusing only on model width, I resized the vocabulary space of the language itself to match the actual physical rank of our 176-wide hidden dimension.
The softmax bottleneck is caused by projecting a narrow hidden vector (d = 176) into a massive vocabulary space (like LLaMA's standard 32,000 or GPT-2's 50,257) via the LM head. By shrinking our vocabulary down to a highly dense, customized 1,536 tokens, we mathematically aligned the rank of our prediction matrix with the lower-rank contextual probability distribution of the TinyStories corpus. This bypassed the degenerate latent representation trap, allowing our 4-layer, 176-wide model to train stably to completion without saturation.
The Attention Layout: Grouped-Query Attention
Our attention architecture uses Grouped-Query Attention (GQA) with 4 query heads and 1 key-value head (a 4:1 head ratio). In a 1.5M parameter model, memory bandwidth is already very fast, but the GQA layout forces the model to compress its keys and values into a shared latent subspace. This acts as a regularizer, preventing individual attention heads from developing isolated representations that lead to early overfitting on the pre-training corpus.
๐ The Data & Training Volume: High-Density Saturation
To get this level of coherence out of 1.49M parameters, I had to ensure the pre-training data was highly dense. I streamed and packed roughly 300,000 unique stories from the TinyStoriesV2 dataset, creating a clean base corpus of about 54 million unique tokens.
While the full training run was scheduled for 113,400 steps, I found the optimal weights at Step 101,000.
At this release point, with an effective batch size of 64 and a context window of 320 tokens, the model had processed exactly 2,068,480,000 tokens (~2.07 Billion tokens).
This means the model cycled through the 300,000-story corpus for approximately 38 full epochs. This massive overtraining densityโaveraging over 1,300 tokens processed for every single parameter in the modelโis what allowed the Nano version to achieve its final 1.5177 validation loss (4.56 Perplexity) and lock in its stable grammatical footing.
โ The Attention Sink Anchor
Another major issue was long-context attention degradation. During generation, small models often start outputting repetitive loops once the sequence length exceeds 100 tokens.
I turned to the breakthrough paper on Attention Sinks by Xiao et al., "Efficient Streaming Language Models with Attention Sinks".
The researchers discovered that autoregressive LLMs naturally allocate massive "attention keys" to the very first tokens of a sequence, regardless of their semantic meaning. These initial tokens act as an attention sink, absorbing unnecessary attention activation energy to keep the softmax values stable. If you use a sliding window or evict those initial tokens, the attention maps blow up, causing catastrophic model collapse.
Automatically Injecting the Sink via Tokenizer Post-Processing
To guarantee that the <sink> token (ID 3) is consistently assigned to Position 0 during both pre-training packaging and downstream user inference, I configured a native "post_processor" directly in the tokenizer.json file.
Using Hugging Face's TemplateProcessing, the tokenizer automatically prepends the <sink> token, followed by the standard start-of-sequence token <s> (ID 0) to any input string. This eliminates the need for manual prompt modification, ensuring the attention heads always have a dedicated, permanent coordinate at step 0 to dump their unused activation energy.
Here is the exact post-processing configuration embedded in the tokenizer:
"post_processor": {
"type": "TemplateProcessing",
"single": [
{
"SpecialToken": {
"id": "<sink>",
"type_id": 0
}
},
{
"SpecialToken": {
"id": "<s>",
"type_id": 0
}
},
{
"Sequence": {
"id": "A",
"type_id": 0
}
}
],
"pair": [
{
"SpecialToken": {
"id": "<sink>",
"type_id": 0
}
},
{
"SpecialToken": {
"id": "<s>",
"type_id": 0
}
},
{
"Sequence": {
"id": "A",
"type_id": 0
}
},
{
"SpecialToken": {
"id": "</s>",
"type_id": 0
}
},
{
"Sequence": {
"id": "B",
"type_id": 1
}
}
],
"special_tokens": {
"<s>": {
"id": "<s>",
"ids": [0],
"tokens": ["<s>"]
},
"</s>": {
"id": "</s>",
"ids": [1],
"tokens": ["</s>"]
},
"<sink>": {
"id": "<sink>",
"ids": [3],
"tokens": ["<sink>"]
}
}
}
By embedding this processing natively, any call to tokenizer(prompt) automatically maps the inputs to a <sink> <s> [Prompt] structure. The result is highly stable long-context attention maps, allowing our tiny 1.49M model to write full, 120+ token stories cleanly.
Implementing the Dedicated Sink Token
To prevent this, I explicitly engineered a dedicated Attention-Sink Anchor (<sink>) into our dataset pipeline.
During data packaging, I force-prepended the <sink> token (ID 3) to Position 0 of every single 320-token block. This gave the attention heads a dedicated, permanent coordinate at step 0 to dump their unused attention energy. The result is stable long-context attention maps, allowing our tiny 1.49M model to write full, 120+ token stories cleanly.
๐ ๏ธ The Suffix-Space Tokenizer (SST) & Trailing Normalization
If you've played with raw Byte-Level BPE tokenizers, you know how sensitive they are to trailing spaces in prompts.
In standard tokenizers, words are often represented as space-prefixed (e.g., _once). If a user inputs a prompt and omits the trailing space (e.g., typing "Once upon a time, a little boy named Paco" instead of "Once upon a time, a little boy named Paco "), the tokenizer is forced to split the final word into character-level fragments (like P + ac + o). This pushes the first step of generation completely out-of-distribution, causing the model to output immediate gibberish.
To solve this, I adopted two core tokenizer philosophies:
- Suffix-Space BPE (SST): Our tokenizer is built on a Suffix-Space philosophy. Words are tokenized with their trailing spaces attached (e.g.,
once,upon,a). This matches the natural way punctuation marks (like commas and periods) attach directly to the preceding word without needing extra embedding dimensions to learn spacing rules. - Built-In Trailing-Space Normalization: I baked a custom trailing-space normalizer directly into the
tokenizer.jsonpre-processing pipeline. We replaced thenullnormalizer block with a regex-basedReplacesequence:
"normalizer": {
"type": "Sequence",
"normalizers": [
{
"type": "Replace",
"pattern": {
"Regex": "(?:\\s)*$"
},
"content": " "
}
]
}
This regex matches any trailing whitespace (or lack thereof) at the very end of the prompt string and standardizes it to exactly one space. This ensures the final word of your prompt is always parsed as a complete, cleanly spaced token, keeping the model's generation starting state within its training distribution.
๐คช The Blooper Reel: Early-Run Failure Logs
During the early stages of the main pre-training run, before the weights had fully converged, the model produced some classic micro-model failures. Here are a few actual snippets I saved during development:
Blooper 1 (Step 11K): The Scared, Climbing Tree
At Step 11K, the model developed a major semantic role-confusion bug, attributing the character's fear to the tree itself:
Prompt Context: play. When they got to the park, they found the tree and started to climb. One little boy, named Tim, was scared to climb the tree. He said to his friends, "I don't want to climb. I am scared." His friends told him, "It's okay, Tim. You don't
Generated: play. When they got to the park, they found the tree and started to climb. One little boy, named Tim, was scared to climb the tree. He said to his friends, "I don't want to climb. I am scared." His friends told him, "It's okay, Tim. You don't. You can do it again."
Tim was sad and said, "I will help you, friends. I will help you." He climbed the tree and got the tree. The tree was happy and said, "Thank you, Tim!" Tim was happy to help his friends. They all worked together to get the tree. The tree was not scared anymore.
From that day on, Tim and his friends were the best of friends. They played together every day. The tree was not scared anymore. The tree was not scared anymore. It was happy and could help its friends. And they all lived happily ever after.
Blooper 2 (Step 20K): Lily Wishes "Forgive the Bird"
At Step 20K, we hit a token-boundary parsing bug where "wished for" was completed as "wished forgive", causing a sudden, nonsensical moral resolution:
Prompt Context: ries. The bird ate them all up and felt much better. Then, something unexpected happened. The bird started to grow bigger and bigger! It turned out that the bird was a magic bird! The bird thanked Lily for saving it and gave her a wish. Lily wished for
Generated: ries. The bird ate them all up and felt much better. Then, something unexpected happened. The bird started to grow bigger and bigger! It turned out that the bird was a magic bird! The bird thanked Lily for saving it and gave her a wish. Lily wished forgive the bird and they all became friends. From that day on, the bird and Lily were the best of friends.
Blooper 3 (Step 80K): The "and, and" Conjunction Stutter
At Step 80K, the model struggled to close its terminal punctuation during a high-temperature generation, getting stuck in an infinite double-conjunction loop:
Prompt Context: . One day, the jeep saw an angry dog. The dog did not want to play. The jeep was sad. The jeep went to find its friends. They all played and had fun. The angry dog saw them and wanted to play too. They all played together and
Generated: . One day, the jeep saw an angry dog. The dog did not want to play. The jeep was sad. The jeep went to find its friends. They all played and had fun. The angry dog saw them and wanted to play too. They all played together and, and they had lots of fun.
๐ง Honest Limitations (Where Does It Break?)
A 1.49M parameter model is a specialist. To be completely transparent, here is where the model runs out of capacity:
- Character Budget: The model can reliably track 2 characters (e.g., Tim and a bird). If you introduce a 3rd or 4th character, its coreference attention weights will collapse, and characters will start morphing names or actions.
- Complex Plot Twists: It struggles with complex plot shifts. If you prompt it with a mystery, it will usually resolve it in the very next sentence by having a character "find" the solution on the floor.
- Long-Range State Consistency: While it is stable for 120 tokens, past 200 tokens it will begin to experience narrative drift (e.g., a character entering a kitchen that was previously described as being inside a cardboard box).
๐๏ธ Side-by-Side Generations (Evaluating under temp=0.35, min_p=0.10)
Below is a direct comparison of our 1.5M (v2.0 Nano) model and the larger 4.0M (v1.0) model, evaluated under identical cohesive sampling parameters.
Seed Prompt: "Once" (Click to Expand)
Aurelius-Llama-4.0M-v1.0
Once upon a time, there was a little girl named Lily. She loved to play with her toys and have fun. One day, she found a big box in her room. She was very happy and wanted to see what was inside. Lily opened the box and found a pretty dress. She put on the dress and went outside to play. She saw her friend, Tom, and said, "Look, Tom! I found a dress!" Tom looked at the dress and smiled. He said, "Wow! That's a nice dress!" Lily and Tom played with the dress all day. They took turns wearing it and pretending to be kings and queens. They had so much fun together. At the end of the day, they put the dress back in the box and said, "We had a great day!"
Aurelius-Llama-v2.0-1.5M-Nano (Our Model)
Once upon a time, there was a little boy named Tim. Tim loved to play with his toy car. One day, he found a big box in his room. He was very curious about the box. Tim opened the box and found a toy car. The car was big and red. Tim was very happy. He wanted to play with the car. But when he triedto pull the car, it did not move. Tim was sad. Then, Tim had an idea. He took the car to his mom. She said, "Let's pull the car out of the box." They pushed and pulled. The car started to move! It was not a car at all. It was a magic car! The car could make the car go fast again. Tim and his mom were very happy. They played with the magic car all day.
Seed Prompt: "Once upon a time, a little boy named Paco" (Click to Expand)
Aurelius-Llama-4.0M-v1.0
Once upon a time, a little boy named Paco was very excited. He wanted to go to the park with his mom. He put on his shoes and ran outside. When Paco got to the park, he saw a big slide. He ran up to it and started to slide down. He slid down fast and laughed as he went faster and faster. He felt so happy and excited. When he got to the park, he saw a big, green frog. The frog was very friendly and said, "Hi, Paco! Do you want to slide with me?" Paco smiled and said, "Yes, let's slide together!" So, Paco and the frog slid down the slide together, laughing and having fun.
Aurelius-Llama-v2.0-1.5M-Nano (Our Model)
Once upon a time, a little boy named Paco was playing with his toy car. He was very excited to see what he could do. He ran to his mom and said, "Mom, I want to play with my car!" His mom smiled and said, "Okay, Paco and your car, but be careful." Paco and his mom played with the car, making it go fast. They had so much fun with the car. But then, something unexpected happened. The car's car started to move! It was not a car, but a big turtle named Tina. Tina was very surprised! "Hello, Paco and Mom!" said Tina. Paco was so surprised that he dropped the car. The turtle said, "Thank you for finding my car! I was stuck as a turtle! Let's play together again!" Paco and Tina were happy to have a new friend.
Prompt: "Once upon a time, a little boy named Paco went to the beach. He" (Click to Expand)
Aurelius-Llama-4.0M-v1.0
Once upon a time, a little boy named Paco went to the beach. He loved to play in the sand and swim in the water. One day, he saw a big, red ball in the sand. He wanted to play with it, so he ran to get it. As Paco got close to the ball, he heard a voice. "Hey, that's my ball!" said the voice. Paco looked around and saw a little girl named Lily. She was holding the ball in her hand. "Hi, Lily!" said Paco. "I found this ball. It was my favorite toy." Lily smiled and said, "Thank you, Paco! I found it in the sand. I found it in the sand." Paco and Lily played with the ball together, and they became good friends.
Aurelius-Llama-v2.0-1.5M-Nano (Our Model)
Once upon a time, a little boy named Paco went to the beach. He was very excited to play in the sand and swim in the water. He saw a big crab and wanted to play with it. Paco was very excited and jumped in the water. The crab jumped in the water and started to swim with Paco. Paco was having so much fun. But then, something unexpected happened. Paco was not a crab at all! It was a big, friendly crab who lived in the ocean. The crab was not a crab anymore. He was a real crab who lived in the ocean. Pacan was so happy to have a new friend, and he played with the crab all day long.
Seed Prompt: "" (Empty String - Cold Start!) (Click to Expand)
Aurelius-Llama-4.0M-v1.0
[None] (Note: The 4.0M model cannot initiate generation from an empty string without a custom prompt handler, as its tokenization structures require a text prefix to align the special token masks).
Aurelius-Llama-v2.0-1.5M-Nano (Our Model)
Once upon a time, there was a little girl named Sue. Sue loved to play with her toys and make music. One day, she found a big box in her room. The box was very old and had many colors. Sue wanted to see what was inside. Sue opened the box and saw a lot of colors. She thought it would be fun to play with the box. So, she took the box and started to drawon the floor. She drew a big sun, a house, and a happy family. Then, she drew a big sun, a green leaf, and a green leaf. Sue was very happy with her new toy. She played with it all day long. But then, she heard a loud noise. It was a big bear! The bear was hungry and wanted to eat Sue. Sue was scared, but she knew she had to be brave. She ran to the bear and said, "Don't worry, Mr. Bear.I will help you find your toy." The bear was very happy and thanked Sue for her help. They played with the toy and had lots of fun. Sue learned that it is good to help others when they need it. And they all lived happily ever after.
๐ก๏ธ Important Disclaimer & Liability Limitation (Click to Expand)
This model is provided strictly "as is" and "with all faults," without warranty of any kind, express or implied.
- Experimental & Research Nature Only: This model is an experimental, micro-scale prototype developed strictly for educational, scientific, and academic benchmarking purposes. Any stated "improvements" or "capabilities" are relative only to other micro-scale baselines and do not indicate suitability for production environments, commercial applications, or consumer-facing products.
- No Safety Alignment: While the pre-training dataset (
TinyStoriesV2) is conceptually designed around simple, child-like narratives, this model has not undergone safety tuning or toxic content filtering. It can output unpredictable, nonsensical, or potentially inappropriate text. Consequently, under no circumstances should this model or its outputs be deemed safe, verified, or appropriate for children or general public interaction. - User Assumption of Risk: Any output generated by this model is the result of statistical text completion and does not represent the views, opinions, or endorsements of the developers or hosting entities. The end-user assumes all liabilities and risks associated with running, testing, or utilizing the model or any downstream text generated by it.
- Architectural and Trademark Clarification: The use of "Llama" in the model name refers solely to the underlying open-source mathematical architecture used to structure the network layers (such as RMSNorm, SwiGLU, and RoPE). This model is trained from scratch and is not affiliated with, endorsed by, or associated with Meta Platforms, Inc. or any of its affiliates.
๐ ๏ธ Usage & Integration
Because this model has been compiled into standard, native layers, you can load it using the standard Hugging Face transformers library with zero custom configurations or remote execution flags.
I recommend using a moderate temperature paired with Min-P sampling for the highest-fidelity outputs:
from transformers import AutoModelForCausalLM, AutoTokenizer
model_name = "MultivexAI/Aurelius-Llama-v2.0-1.5M-Nano"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name)
# Note: You can omit the trailing space; the tokenizer normalizer will automatically handle it!
prompt = "Once upon a time, a little boy named Paco"
inputs = tokenizer(prompt, return_tensors="pt")
outputs = model.generate(
**inputs,
max_new_tokens=128,
temperature=0.60,
min_p=0.15,
do_sample=True,
pad_token_id=tokenizer.eos_token_id,
eos_token_id=tokenizer.eos_token_id
)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
๐ References
- Hugging Face / Studying Language Model Saturation via the Softmax Bottleneck (2024). "Why do small language models underperform? Studying Language Model Saturation via the Softmax Bottleneck."
- Hugging Face (2025). "The Optimal Architecture for Small Language Models."
- Xiao, G., et al. (2023). "Efficient Streaming Language Models with Attention Sinks." arXiv preprint arXiv:2309.17453.
- Downloads last month
- 22