SLM-10M
A 9.97M parameter causal language model trained from scratch, targeting the Open SLM Leaderboard <10M tier.
Intended Use
This is a research model optimised for NLU benchmarking tasks, not open-ended generation. It is best suited for:
| Task | Examples |
|---|---|
| Multiple-choice QA | ARC, HellaSwag, PIQA, ArithMark β score each candidate and pick the highest |
| Log-likelihood ranking | Rank candidate continuations or document relevance by perplexity |
| SLM research | Ablations, architecture studies, efficiency benchmarks at the <10M scale |
| Perplexity evaluation | Measuring language model fit on held-out text corpora |
It is not suited for open-ended text generation, chat, or instruction following β at 10M parameters the vocabulary (8,192 tokens) and capacity are too limited for fluent free-form output.
Model Details
| Property | Value |
|---|---|
| Parameters | 9,968,640 (~10M) |
| Architecture | Causal Transformer |
| Vocabulary | 8,192 tokens |
| Context length | 1,024 tokens |
| Training tokens | 25B |
| Precision | bfloat16 |
Architecture
| Component | Config |
|---|---|
| Hidden size | 256 |
| Layers | 12 |
| Q heads / KV heads | 8 / 2 (GQA) |
| Head dim | 32 |
| FFN intermediate | 640 |
| Positional encoding | RoPE (ΞΈ=100k) |
| Normalization | RMSNorm (fp32 upcast) |
| Activation | SwiGLU |
| Attention | GQA + QK-Norm |
| Weight tying | Embed β LM head |
Design follows SotA SLM recipes (GPT-X2, Qwen3, Gemma2): QK-Norm prevents attention logit explosion, Z-loss stabilises early training (disabled after 31B tokens), scaled residual init keeps residual stream variance bounded.
Training
Data mix (25B tokens total):
| Source | Weight |
|---|---|
| FineWeb-Edu | 55% |
| Cosmopedia-v2 | 25% |
| FineWeb-HQ | 10% |
| FineMath | 10% |
Optimizer: AdamW (fused) β lr=3e-3, min_lr=3e-4, Ξ²=(0.9, 0.95), wd=0.1, grad_clip=1.0
LR schedule: Warmup (1k steps) β stable β cosine decay tail (last 15% of steps)
Batch: 512K tokens/step (micro-batch 32 Γ grad_accum 16 Γ seq_len 1024)
Hardware: NVIDIA GB10, bfloat16, torch.compile
Evaluation
Zero-shot evaluation on the Open SLM Leaderboard benchmarks:
| Benchmark | Score |
|---|---|
| HellaSwag (acc_norm) | 26.53% |
| ARC-Easy (acc_norm) | 30.47% |
| ARC-Challenge (acc_norm) | 25.00% |
| PIQA (acc_norm) | 50.92% |
| ArithMark-2.0 | 24.32% |
| Avg | 32.38% |
Avg = (HellaSwag + (ARC-Easy + ARC-Challenge) / 2 + PIQA + ArithMark) / 4
Evaluated using lm-evaluation-harness and the ArithMark-2.0 custom benchmark script.
Usage
This model is a research artifact for benchmarking, not a chat or generation model. At 10M parameters it excels at log-likelihood ranking tasks (multiple-choice benchmarks) rather than free-text generation.
Scoring / ranking (recommended)
from transformers import AutoTokenizer, AutoModelForCausalLM
import torch
import torch.nn.functional as F
model = AutoModelForCausalLM.from_pretrained(
"liodon-ai/slm-10m",
trust_remote_code=True,
dtype=torch.bfloat16,
).to("cuda")
tokenizer = AutoTokenizer.from_pretrained("liodon-ai/slm-10m", trust_remote_code=True)
def score(context, completion):
full = tokenizer.encode(context + completion, return_tensors="pt").to("cuda")
ctx_len = len(tokenizer.encode(context, add_special_tokens=False))
with torch.no_grad():
logits = model(full).logits[0]
return -F.cross_entropy(logits[ctx_len - 1:-1], full[0, ctx_len:]).item()
context = "Which is an example of a renewable energy resource? Answer:"
choices = [" biomass", " coal", " gas", " oil"]
scores = [score(context, c) for c in choices]
best = choices[scores.index(max(scores))]
print(f"Best answer: {best.strip()}")
# β Best answer: biomass
Citation
@software{liodonai2026slm10m,
author = {{Liodon AI}},
title = {SLM-10M},
year = {2026},
url = {https://huggingface.co/liodon-ai/slm-10m}
}
License
Apache 2.0
- Downloads last month
- 228