TinyStories GPT (19M)
A small (~19.2M parameter) decoder-only GPT trained from scratch on TinyStories. It writes simple, coherent children's stories and is a compact, hackable reference for modern LLM architecture + optimization techniques — trained end-to-end in a few minutes on a single consumer GPU (RTX 2060 Super, 8 GB).
This checkpoint uses the full modded-nanoGPT-style recipe: the Muon optimizer plus QK-Norm + squared-ReLU MLP + logit soft-capping + zero-init projections. Each technique was A/B-measured on the 2060; together they lower validation loss from 2.65 (plain AdamW/SwiGLU baseline) to 2.40 at the same 3,000 steps.
Sample output
Once upon a time, there was a little girl named Lily. She loved to play with her toys and her favorite toy, a toy truck. One day, Lily's mommy made her a yummy chocolate cake to make her happy. Lily's friend, Timmy, came over to play...
Lily and Tom went to the park and saw a big dog... "Mom, mom, the dog is coming!" Lily cried. "The dog is not mean. It was friendly and friendly. It wants to play with us."
Architecture
A LLaMA-/modded-nanoGPT-style decoder-only transformer:
| Component | Choice |
|---|---|
| Layers / heads / dim | 8 layers, 6 heads, n_embd 384 |
| Context length | 256 tokens |
| Vocabulary | 16,384 (ByteLevel BPE) |
| Position encoding | RoPE |
| Attention | Grouped-Query Attention (2 KV heads) + QK-Norm |
| MLP | squared-ReLU (ungated) |
| Normalization | RMSNorm |
| Init | zero-init block output projections (muP-like) |
| Logits | soft-capped at 15 (cap·tanh(logits/cap)) |
| Extra heads | Multi-Token Prediction (2 auxiliary heads) |
| Weight tying | token embedding ↔ output head (and MTP heads) |
Training
| Dataset | TinyStories (~2.1M stories) |
| Steps | 3,000 |
| Batch | 40 × 256 tokens |
| Optimizer | Muon (2D weights) + AdamW (embeddings/norms), peak LR 3e-3, cosine schedule |
| Precision | fp16 mixed precision, torch.compile |
| Hardware | 1× RTX 2060 Super (8 GB), ~8 minutes |
| Train loss | 2.47 (combined next-token + MTP auxiliary) |
| Validation loss | 2.40 (perplexity ~11.0) |
Usage
This is a custom architecture, so you need model.py from this repo (small,
dependency-light). Download it next to your script, then:
import torch
from huggingface_hub import hf_hub_download
from tokenizers import Tokenizer
from model import GPT # model.py downloaded from this repo
repo = "epoyraz/tinystories-25m"
ckpt = torch.load(
hf_hub_download(repo, "tinystories-25m.pt"),
map_location="cpu", weights_only=True,
)
model = GPT(ckpt["config"]).eval()
model.load_state_dict(ckpt["model"])
tok = Tokenizer.from_file(hf_hub_download(repo, "tokenizer.json"))
ids = tok.encode("Once upon a time,").ids
out = model.generate(
torch.tensor([ids]), max_new_tokens=120, temperature=0.7, top_k=40,
)
print(tok.decode(out[0].tolist()))
pip install torch tokenizers huggingface_hub
Files
tinystories-25m.pt— checkpoint (config+modelstate dict)model.py— model definition (GPT, all techniques)config.json— the model config, for referencetokenizer.json— ByteLevel BPE tokenizer (16K vocab)
Limitations
- Trained only on TinyStories — simple children's-story English, not a general assistant.
- Small and lightly trained: occasional repetition, name swaps, or drift.
- 256-token context.
References
- Downloads last month
- 44