SLM-10M

A 9.97M parameter causal language model trained from scratch, targeting the Open SLM Leaderboard <10M tier.

Intended Use

This is a research model optimised for NLU benchmarking tasks, not open-ended generation. It is best suited for:

Task Examples
Multiple-choice QA ARC, HellaSwag, PIQA, ArithMark β€” score each candidate and pick the highest
Log-likelihood ranking Rank candidate continuations or document relevance by perplexity
SLM research Ablations, architecture studies, efficiency benchmarks at the <10M scale
Perplexity evaluation Measuring language model fit on held-out text corpora

It is not suited for open-ended text generation, chat, or instruction following β€” at 10M parameters the vocabulary (8,192 tokens) and capacity are too limited for fluent free-form output.

Model Details

Property Value
Parameters 9,968,640 (~10M)
Architecture Causal Transformer
Vocabulary 8,192 tokens
Context length 1,024 tokens
Training tokens 25B
Precision bfloat16

Architecture

Component Config
Hidden size 256
Layers 12
Q heads / KV heads 8 / 2 (GQA)
Head dim 32
FFN intermediate 640
Positional encoding RoPE (ΞΈ=100k)
Normalization RMSNorm (fp32 upcast)
Activation SwiGLU
Attention GQA + QK-Norm
Weight tying Embed ↔ LM head

Design follows SotA SLM recipes (GPT-X2, Qwen3, Gemma2): QK-Norm prevents attention logit explosion, Z-loss stabilises early training (disabled after 31B tokens), scaled residual init keeps residual stream variance bounded.

Training

Data mix (25B tokens total):

Source Weight
FineWeb-Edu 55%
Cosmopedia-v2 25%
FineWeb-HQ 10%
FineMath 10%

Optimizer: AdamW (fused) β€” lr=3e-3, min_lr=3e-4, Ξ²=(0.9, 0.95), wd=0.1, grad_clip=1.0

LR schedule: Warmup (1k steps) β†’ stable β†’ cosine decay tail (last 15% of steps)

Batch: 512K tokens/step (micro-batch 32 Γ— grad_accum 16 Γ— seq_len 1024)

Hardware: NVIDIA GB10, bfloat16, torch.compile

Evaluation

Zero-shot evaluation on the Open SLM Leaderboard benchmarks:

Benchmark Score
HellaSwag (acc_norm) 26.53%
ARC-Easy (acc_norm) 30.47%
ARC-Challenge (acc_norm) 25.00%
PIQA (acc_norm) 50.92%
ArithMark-2.0 24.32%
Avg 32.38%

Avg = (HellaSwag + (ARC-Easy + ARC-Challenge) / 2 + PIQA + ArithMark) / 4

Evaluated using lm-evaluation-harness and the ArithMark-2.0 custom benchmark script.

Usage

This model is a research artifact for benchmarking, not a chat or generation model. At 10M parameters it excels at log-likelihood ranking tasks (multiple-choice benchmarks) rather than free-text generation.

Scoring / ranking (recommended)

from transformers import AutoTokenizer, AutoModelForCausalLM
import torch
import torch.nn.functional as F

model = AutoModelForCausalLM.from_pretrained(
    "liodon-ai/slm-10m",
    trust_remote_code=True,
    dtype=torch.bfloat16,
).to("cuda")
tokenizer = AutoTokenizer.from_pretrained("liodon-ai/slm-10m", trust_remote_code=True)

def score(context, completion):
    full = tokenizer.encode(context + completion, return_tensors="pt").to("cuda")
    ctx_len = len(tokenizer.encode(context, add_special_tokens=False))
    with torch.no_grad():
        logits = model(full).logits[0]
    return -F.cross_entropy(logits[ctx_len - 1:-1], full[0, ctx_len:]).item()

context = "Which is an example of a renewable energy resource? Answer:"
choices = [" biomass", " coal", " gas", " oil"]
scores  = [score(context, c) for c in choices]
best    = choices[scores.index(max(scores))]
print(f"Best answer: {best.strip()}")
# β†’ Best answer: biomass

Citation

@software{liodonai2026slm10m,
  author = {{Liodon AI}},
  title = {SLM-10M},
  year = {2026},
  url = {https://huggingface.co/liodon-ai/slm-10m}
}

License

Apache 2.0

Downloads last month
228
Safetensors
Model size
12.1M params
Tensor type
F32
Β·
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Spaces using liodon-ai/slm-10m 2