TinyStories GPT (19M)

A small (~19.2M parameter) decoder-only GPT trained from scratch on TinyStories. It writes simple, coherent children's stories and is a compact, hackable reference for modern LLM architecture + optimization techniques — trained end-to-end in a few minutes on a single consumer GPU (RTX 2060 Super, 8 GB).

This checkpoint uses the full modded-nanoGPT-style recipe: the Muon optimizer plus QK-Norm + squared-ReLU MLP + logit soft-capping + zero-init projections. Each technique was A/B-measured on the 2060; together they lower validation loss from 2.65 (plain AdamW/SwiGLU baseline) to 2.40 at the same 3,000 steps.

Sample output

Once upon a time, there was a little girl named Lily. She loved to play with her toys and her favorite toy, a toy truck. One day, Lily's mommy made her a yummy chocolate cake to make her happy. Lily's friend, Timmy, came over to play...

Lily and Tom went to the park and saw a big dog... "Mom, mom, the dog is coming!" Lily cried. "The dog is not mean. It was friendly and friendly. It wants to play with us."

Architecture

A LLaMA-/modded-nanoGPT-style decoder-only transformer:

Component Choice
Layers / heads / dim 8 layers, 6 heads, n_embd 384
Context length 256 tokens
Vocabulary 16,384 (ByteLevel BPE)
Position encoding RoPE
Attention Grouped-Query Attention (2 KV heads) + QK-Norm
MLP squared-ReLU (ungated)
Normalization RMSNorm
Init zero-init block output projections (muP-like)
Logits soft-capped at 15 (cap·tanh(logits/cap))
Extra heads Multi-Token Prediction (2 auxiliary heads)
Weight tying token embedding ↔ output head (and MTP heads)

Training

Dataset TinyStories (~2.1M stories)
Steps 3,000
Batch 40 × 256 tokens
Optimizer Muon (2D weights) + AdamW (embeddings/norms), peak LR 3e-3, cosine schedule
Precision fp16 mixed precision, torch.compile
Hardware 1× RTX 2060 Super (8 GB), ~8 minutes
Train loss 2.47 (combined next-token + MTP auxiliary)
Validation loss 2.40 (perplexity ~11.0)

Usage

This is a custom architecture, so you need model.py from this repo (small, dependency-light). Download it next to your script, then:

import torch
from huggingface_hub import hf_hub_download
from tokenizers import Tokenizer
from model import GPT  # model.py downloaded from this repo

repo = "epoyraz/tinystories-25m"
ckpt = torch.load(
    hf_hub_download(repo, "tinystories-25m.pt"),
    map_location="cpu", weights_only=True,
)
model = GPT(ckpt["config"]).eval()
model.load_state_dict(ckpt["model"])

tok = Tokenizer.from_file(hf_hub_download(repo, "tokenizer.json"))
ids = tok.encode("Once upon a time,").ids
out = model.generate(
    torch.tensor([ids]), max_new_tokens=120, temperature=0.7, top_k=40,
)
print(tok.decode(out[0].tolist()))

pip install torch tokenizers huggingface_hub

Files

  • tinystories-25m.pt — checkpoint (config + model state dict)
  • model.py — model definition (GPT, all techniques)
  • config.json — the model config, for reference
  • tokenizer.json — ByteLevel BPE tokenizer (16K vocab)

Limitations

  • Trained only on TinyStories — simple children's-story English, not a general assistant.
  • Small and lightly trained: occasional repetition, name swaps, or drift.
  • 256-token context.

References

Downloads last month
44
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Dataset used to train epoyraz/tinystories-25m

Papers for epoyraz/tinystories-25m