axiomic banner

GPT-S-5M

GPT-S-5M is a first-generation model in the GPT-S small-model family: 5M parameters, 25B training tokens, a custom 4K tokenizer, 9 layers, and all new Exclusive Grouped-query Attention (XGQA), trained from scratch on a 3-source corpus.

At 5M parameters, GPT-S-5M achieves best-in-class performance among the small open models evaluated in this benchmark set, outperforming models 6x larger.

Benchmarks

All models evaluated in bf16 with an internal harness modeled on EleutherAI/lm-eval-harness. Scores are zero-shot; normalized accuracy is used where available.

Company Model Hellaswag ARC (easy) PIQA Arithmark BLiMP Average
Axiomic Labs GPT-S-5M 27.39% 33.16% 57.13% 31.5% 72.21% 44.28%
EleutherAI pythia-31m 27.14% 33.88% 56.26% 29.44% 67.78% 42.90%
EleutherAI pythia-14m 26.20% 32.28% 55.88% 28.06% 66.75% 41.83%
LH-Tech-AI Spark-5M-Base-v4 27.03% 33.21% 53.43% 32.70% 62.17% 41.71%
SupraLabs Supra-Mini-v5-8M 26.38% 33.33% 54.03% 27.30% 63.83% 40.97%
SupraLabs Supra-Mini-v4-2M 25.52% 30.98% 51.90% 29.72% 60.57% 39.74%

Architecture

Open AxiomicLabs/GPT-S-5M in hfviewer
Component Details
Position encoding RoPE, theta=2,500
Normalization RMSNorm
Feed-forward SwiGLU
Attention Exclusive Grouped-query attention (XGQA), 6 query heads / 2 KV heads
Embeddings Weight tied
Context length 512 tokens
Parameters 5,158,464

Config

vocab_size       = 4,096
hidden_size      = 192
num_layers       = 9
num_heads        = 6
num_kv_heads     = 2
head_dim         = 32
intermediate     = 672
block_size       = 512
rope_theta       = 2,500

Training

GPT-S-5M was trained for 25B tokens with a mixture built around educational web text, synthetic textbook-style material, and higher-quality web text.

Source Dataset Mix Purpose
FineWeb-Edu HuggingFaceFW/fineweb-edu 55% Primary educational web text
Cosmopedia v2 HuggingFaceTB/smollm-corpus 30% Synthetic textbook-style coverage
FineWeb-HQ epfml/FineWeb-HQ 15% Higher-quality general web text

Hyperparameters

Hyperparameter Value
Optimizer AdamW
Adam betas 0.9 / 0.95
Weight decay 0.01
Peak learning rate 2.5e-3
Minimum learning rate 0
LR schedule Warmup-stable-decay
Warmup steps 1,500
Decay start 70% of training
Training tokens 25B
Total batch size 262,144 tokens
Microbatch 128 x 512 tokens
Gradient accumulation steps 4
Gradient clipping 1.0
Precision bfloat16 autocast

Usage

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

model_name = "AxiomicLabs/GPT-S-5M"

tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    trust_remote_code=True,
    torch_dtype=torch.bfloat16,
    device_map="auto",
)

prompt = "The future of AI is"
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)

with torch.inference_mode():
    output = model.generate(
        **inputs,
        max_new_tokens=120,
        do_sample=True,
        temperature=0.8,
        top_p=0.95,
        repetition_penalty=1.1,
        no_repeat_ngram_size=4,
    )

print(tokenizer.decode(output[0], skip_special_tokens=True))

Limitations

This is a small base language model. It is not instruction tuned, has limited factual capacity, and uses a 512 token context window.

Downloads last month
140
Safetensors
Model size
5.16M params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Datasets used to train AxiomicLabs/GPT-S-5M