AtomixS2-5M-v1.0-GGUF

titl

A Quick Note on the GGUF Files

Since this is the GGUF version of the model, we wanted to share a quick heads-up about which files you should actually download.

We highly recommend sticking to the unquantized F32 or F16 files.

At 5.98 million parameters, there is simply no margin for error. While massive models can easily handle being squeezed down to 4-bit or 3-bit sizes, this tiny network gets scrambled very easily.

If you try running the smaller files (like Q8_0 or Q4_K_M), the rounding errors completely overwhelm the layers. The model will lose its grip on basic English grammar, spell words wrong, and get stuck in loops. We uploaded those smaller, heavily quantized files purely as experimental artifacts for you to play around with, but if you want to see the model actually follow grammar and reasoning, stick to the F16 or F32 files.


At AtomixLabs, our research often focuses on the physical constraints of neural architectures. With AtomixS2-5M-v1.0, we wanted to explore the absolute floor of language acquisition: What happens when you restrict a model to just under 6 million parameters?

At this extreme scale, a model does not have the capacity to memorize the internet, store vast encyclopedias of trivia, or comprehend deep physical-world mechanics. Instead, it becomes a pure engine of syntax and structure. This model is our attempt to build a highly active, fluent, and structurally sound micro-model that runs easily on almost any hardware—from older consumer GPUs to standard laptop processors. It is an exploration of parameter density and careful data curation over sheer scale.

Architectural Constraints and Vocabulary Design

AtomixS2-5M is a standard decoder-only transformer, built with a very tight structural configuration:

  • Parameter Count: 5.98 Million
  • Layers: 4
  • Attention Heads: 8
  • Context Window: 512 Tokens

When working with a parameter budget this constrained, traditional methods of tokenization become a major liability. Standard vocabularies often leak capacity by assigning precious embedding parameters to unused symbols, broken formatting fragments, or redundant uppercase and lowercase variations of the exact same word.

To resolve this, we engineered a custom, highly dense 3,584-token vocabulary. We utilized custom token mapping to ensure the model doesn't waste parameters relearning basic capitalization patterns or storing empty placeholder slots. Every single parameter in the embedding matrix is designed to be an active, high-yield node. By structuring the vocabulary this way, the model is able to direct more of its limited capacity toward understanding grammatical rules and logical transitions.

The Training Mixture: Syntax over Trivia

To get a model this small to output cohesive text, the training diet has to be incredibly deliberate. If you feed a micro-model nothing but raw, unfiltered web data, it tends to become noisy and chaotic. We trained AtomixS2 over a heavily curated, multi-domain mixture designed specifically to teach structure, narrative flow, and procedural formatting rather than just isolated facts.

Our final pre-training corpus was constructed using carefully balanced subsets from the following open-source datasets:

  • openbmb/Ultra-FineWeb-L3 (~54%)
  • HuggingFaceTB/smollm-corpus (~22%)
  • Aarushhh/finemath-refined (~14%)
  • openbmb/UltraData-Math (~10%)

By blending foundational web text with conversational narratives and a heavy dose of structured mathematics, we provided the model with a strong cognitive anchor. The math data, for instance, isn't there to teach a 6M parameter model advanced calculus. Instead, it forces the model's limited attention heads to learn how to track states, format markdown lists, close LaTeX brackets, and follow a strict sequential chain of thought, which cleanly transfers over to its general English syntax.

Recommended Generation Settings

Because AtomixS2-5M has a narrow parameter space, its probability distributions can behave differently than those of massive models. To get the cleanest, most coherent text generation, we strongly recommend the following sampling parameters:

  • temperature: 0.5 — A slightly lower temperature helps keep the model grounded. It prevents the network from wandering into the noisy "tail" of its vocabulary.
  • min_p: 0.1 — This dynamically truncates the lowest-probability tokens. It acts as a great filter against sudden hallucinations or spelling breaks.
  • repetition_penalty: 1.05 to 1.1 — Small models can occasionally get caught in structural formatting loops (like generating continuous markdown tables). A light repetition penalty gently nudges the model to keep moving forward without destroying its natural vocabulary flow.

Benchmark Performance

We evaluated AtomixS2-5M-v1.0 using the standard EleutherAI evaluation harness. The scores reported below are length-normalized accuracies (acc_norm), which provide the fairest assessment of a model's true reasoning ability by neutralizing length bias.

Benchmark Score What this means at the 5M Scale
HellaSwag 28.27% Tests common-sense sentence completion. The model performs exceptionally well here because its grammar and syntactic transitions are highly stable.
ArithMark 2.0 27.92% Tests basic integer sequence prediction. The inclusion of procedural math data allows the model to handle numeric formats and basic arithmetic reliably.
ARC-Easy 32.79% Tests basic grade-school science logic.
ARC-Challenge 21.08% Tests advanced reasoning. Expectedly difficult for micro-models.
PIQA 53.70% Tests physical real-world trivia (e.g., how water reacts with a sponge). Because a 5.98M model lacks the capacity to store vast amounts of real-world physical trivia, this score reflects our natural trade-off of dedicating limited parameters to syntax rather than database memorization.

What It's Like to Use

Interacting with AtomixS2-5M is a unique experience. It leans heavily into a formal, academic, and highly structured tone. It spells words with high accuracy and reliably outputs clean punctuation, markdown steps, and logical clauses.

However, users should keep in mind that its factual grounding is incredibly thin. It will confidently hallucinate an entirely fabricated historical event or scientific theory, but it will do so using flawless grammar and excellent paragraph structure. It is a model that has mastered the shape of human language, but relies entirely on you to provide the factual context. It serves as an excellent, lightweight foundation for local research, educational testing, or syntax-parsing experiments.


Safety Disclaimer, Ethical Considerations & Limitations

Limitations and Operational Warnings

AtomixS2-5M-v1.0 is an experimental, research-oriented micro-model. Due to its extremely limited parameter scale, it fundamentally lacks the complex contextual grounding, broad world knowledge, and safety-alignment mechanisms integrated into larger, commercial language models. Please read the following operational guidelines and limitations carefully before deploying or interacting with this model.

1. Severe Factual Hallucination This model is highly susceptible to generating plausible-sounding but entirely fabricated information. It has learned the grammatical rules of how facts are stated, but it does not have the parameter capacity to store the facts themselves.

  • No Critical Use: Under no circumstances should this model be used to generate or verify medical, legal, financial, or safety-critical information.
  • Verification Required: Any factual claims, citations, or mathematical calculations produced by the model must be independently verified by a human expert.

2. Absence of Safety Alignment (RLHF) AtomixS2-5M-v1.0 is a foundational pre-trained model. It has not undergone Reinforcement Learning from Human Feedback (RLHF), Constitutional AI training, or any other adversarial safety-tuning process.

  • Unfiltered Outputs: The model may generate outputs that are biased, offensive, explicit, or otherwise inappropriate. It will generally comply with malicious or unethical prompts without triggering any refusal mechanisms.
  • Inherited Biases: The model was trained on subsets of public web data and synthetic datasets. It inevitably reflects and may amplify the historical, societal, and cultural biases present in that training data.

3. Contextual and Logical Drift While the model demonstrates strong syntactic stability over short generations, its 512-token context window and small hidden dimensions limit its long-range focus. On longer generations, the model may drift off-topic, repeat structural formats, or devolve into logical contradictions. It is best utilized for short-form generation, syntax parsing, and foundational research rather than extended multi-turn conversations.

4. Not for Public or Unsupervised Deployment Because the model lacks content filters and safety guardrails, it is highly unsuitable for unsupervised deployment in public-facing applications, customer service bots, or environments where it might interact with minors. Any integration into a software pipeline should include robust external filtering and safety moderation layers.

5. Liability and "As-Is" Provision AtomixS2-5M-v1.0 is provided by AtomixLabs strictly "as is" for the purposes of academic research, efficiency testing, and local machine learning education. AtomixLabs makes no warranties regarding the safety, accuracy, or reliability of the model. AtomixLabs and its contributors take no responsibility and bear no liability for the outputs generated by the model, or for any downstream applications, damages, or consequences resulting from its use. Users assume full responsibility for how they deploy, fine-tune, and interact with the model weights.

Acknowledgements & Licensing

This model was trained on the following datasets:

Downloads last month
198
GGUF
Model size
5.98M params
Architecture
llama
Hardware compatibility
Log In to add your hardware

1-bit

2-bit

3-bit

4-bit

5-bit

6-bit

16-bit

32-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Datasets used to train AtomixLabs/AtomixS2-5M-v1.0-GGUF

Collection including AtomixLabs/AtomixS2-5M-v1.0-GGUF