SLM-10M

A 9.97M parameter causal language model trained from scratch, targeting the Open SLM Leaderboard <10M tier.

Intended Use

This is a research model optimised for NLU benchmarking tasks, not open-ended generation. It is best suited for:

Task	Examples
Multiple-choice QA	ARC, HellaSwag, PIQA, ArithMark — score each candidate and pick the highest
Log-likelihood ranking	Rank candidate continuations or document relevance by perplexity
SLM research	Ablations, architecture studies, efficiency benchmarks at the <10M scale
Perplexity evaluation	Measuring language model fit on held-out text corpora

It is not suited for open-ended text generation, chat, or instruction following — at 10M parameters the vocabulary (8,192 tokens) and capacity are too limited for fluent free-form output.

Model Details

Property	Value
Parameters	9,968,640 (~10M)
Architecture	Causal Transformer
Vocabulary	8,192 tokens
Context length	1,024 tokens
Training tokens	25B
Precision	bfloat16

Architecture

Component	Config
Hidden size	256
Layers	12
Q heads / KV heads	8 / 2 (GQA)
Head dim	32
FFN intermediate	640
Positional encoding	RoPE (θ=100k)
Normalization	RMSNorm (fp32 upcast)
Activation	SwiGLU
Attention	GQA + QK-Norm
Weight tying	Embed ↔ LM head

Design follows SotA SLM recipes (GPT-X2, Qwen3, Gemma2): QK-Norm prevents attention logit explosion, Z-loss stabilises early training (disabled after 31B tokens), scaled residual init keeps residual stream variance bounded.

Training

Data mix (25B tokens total):

Source	Weight
FineWeb-Edu	55%
Cosmopedia-v2	25%
FineWeb-HQ	10%
FineMath	10%

Optimizer: AdamW (fused) — lr=3e-3, min_lr=3e-4, β=(0.9, 0.95), wd=0.1, grad_clip=1.0

LR schedule: Warmup (1k steps) → stable → cosine decay tail (last 15% of steps)

Batch: 512K tokens/step (micro-batch 32 × grad_accum 16 × seq_len 1024)

Hardware: NVIDIA GB10, bfloat16, torch.compile

Evaluation

Zero-shot evaluation on the Open SLM Leaderboard benchmarks:

Benchmark	Score
HellaSwag (acc_norm)	26.53%
ARC-Easy (acc_norm)	30.47%
ARC-Challenge (acc_norm)	25.00%
PIQA (acc_norm)	50.92%
ArithMark-2.0	24.32%
Avg	32.38%

Avg = (HellaSwag + (ARC-Easy + ARC-Challenge) / 2 + PIQA + ArithMark) / 4

Evaluated using lm-evaluation-harness and the ArithMark-2.0 custom benchmark script.

Usage

This model is a research artifact for benchmarking, not a chat or generation model. At 10M parameters it excels at log-likelihood ranking tasks (multiple-choice benchmarks) rather than free-text generation.

Scoring / ranking (recommended)

from transformers import AutoTokenizer, AutoModelForCausalLM
import torch
import torch.nn.functional as F

model = AutoModelForCausalLM.from_pretrained(
    "liodon-ai/slm-10m",
    trust_remote_code=True,
    dtype=torch.bfloat16,
).to("cuda")
tokenizer = AutoTokenizer.from_pretrained("liodon-ai/slm-10m", trust_remote_code=True)

def score(context, completion):
    full = tokenizer.encode(context + completion, return_tensors="pt").to("cuda")
    ctx_len = len(tokenizer.encode(context, add_special_tokens=False))
    with torch.no_grad():
        logits = model(full).logits[0]
    return -F.cross_entropy(logits[ctx_len - 1:-1], full[0, ctx_len:]).item()

context = "Which is an example of a renewable energy resource? Answer:"
choices = [" biomass", " coal", " gas", " oil"]
scores  = [score(context, c) for c in choices]
best    = choices[scores.index(max(scores))]
print(f"Best answer: {best.strip()}")
# → Best answer: biomass

Citation

@software{liodonai2026slm10m,
  author = {{Liodon AI}},
  title = {SLM-10M},
  year = {2026},
  url = {https://huggingface.co/liodon-ai/slm-10m}
}

License

Apache 2.0

Downloads last month: 228

Safetensors

Model size

12.1M params

Tensor type

F32

liodon-ai
/

slm-10m