GPT-S-5M

GPT-S-5M is a first-generation model in the GPT-S small-model family: 5M parameters, 25B training tokens, a custom 4K tokenizer, 9 layers, and all new Exclusive Grouped-query Attention (XGQA), trained from scratch on a 3-source corpus.

At 5M parameters, GPT-S-5M achieves best-in-class performance among the small open models evaluated in this benchmark set, outperforming models 6x larger.

Benchmarks

All models evaluated in bf16 with an internal harness modeled on EleutherAI/lm-eval-harness. Scores are zero-shot; normalized accuracy is used where available.

Company	Model	Hellaswag	ARC (easy)	PIQA	Arithmark	BLiMP	Average
Axiomic Labs	GPT-S-5M	27.39%	33.16%	57.13%	31.5%	72.21%	44.28%
EleutherAI	pythia-31m	27.14%	33.88%	56.26%	29.44%	67.78%	42.90%
EleutherAI	pythia-14m	26.20%	32.28%	55.88%	28.06%	66.75%	41.83%
LH-Tech-AI	Spark-5M-Base-v4	27.03%	33.21%	53.43%	32.70%	62.17%	41.71%
SupraLabs	Supra-Mini-v5-8M	26.38%	33.33%	54.03%	27.30%	63.83%	40.97%
SupraLabs	Supra-Mini-v4-2M	25.52%	30.98%	51.90%	29.72%	60.57%	39.74%

Architecture

Component	Details
Position encoding	RoPE, theta=2,500
Normalization	RMSNorm
Feed-forward	SwiGLU
Attention	Exclusive Grouped-query attention (XGQA), 6 query heads / 2 KV heads
Embeddings	Weight tied
Context length	512 tokens
Parameters	5,158,464

Config

vocab_size       = 4,096
hidden_size      = 192
num_layers       = 9
num_heads        = 6
num_kv_heads     = 2
head_dim         = 32
intermediate     = 672
block_size       = 512
rope_theta       = 2,500

Training

GPT-S-5M was trained for 25B tokens with a mixture built around educational web text, synthetic textbook-style material, and higher-quality web text.

Source	Dataset	Mix	Purpose
FineWeb-Edu	HuggingFaceFW/fineweb-edu	55%	Primary educational web text
Cosmopedia v2	HuggingFaceTB/smollm-corpus	30%	Synthetic textbook-style coverage
FineWeb-HQ	epfml/FineWeb-HQ	15%	Higher-quality general web text

Hyperparameters

Hyperparameter	Value
Optimizer	AdamW
Adam betas	0.9 / 0.95
Weight decay	0.01
Peak learning rate	2.5e-3
Minimum learning rate	0
LR schedule	Warmup-stable-decay
Warmup steps	1,500
Decay start	70% of training
Training tokens	25B
Total batch size	262,144 tokens
Microbatch	128 x 512 tokens
Gradient accumulation steps	4
Gradient clipping	1.0
Precision	bfloat16 autocast

Usage

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

model_name = "AxiomicLabs/GPT-S-5M"

tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    trust_remote_code=True,
    torch_dtype=torch.bfloat16,
    device_map="auto",
)

prompt = "The future of AI is"
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)

with torch.inference_mode():
    output = model.generate(
        **inputs,
        max_new_tokens=120,
        do_sample=True,
        temperature=0.8,
        top_p=0.95,
        repetition_penalty=1.1,
        no_repeat_ngram_size=4,
    )

print(tokenizer.decode(output[0], skip_special_tokens=True))

Limitations

This is a small base language model. It is not instruction tuned, has limited factual capacity, and uses a 512 token context window.

Downloads last month: 140

Safetensors

Model size

5.16M params

Tensor type

BF16

AxiomicLabs
/

GPT-S-5M