TinyStories GPT (19M)

A small (~19.2M parameter) decoder-only GPT trained from scratch on TinyStories. It writes simple, coherent children's stories and is a compact, hackable reference for modern LLM architecture + optimization techniques — trained end-to-end in a few minutes on a single consumer GPU (RTX 2060 Super, 8 GB).

This checkpoint uses the full modded-nanoGPT-style recipe: the Muon optimizer plus QK-Norm + squared-ReLU MLP + logit soft-capping + zero-init projections. Each technique was A/B-measured on the 2060; together they lower validation loss from 2.65 (plain AdamW/SwiGLU baseline) to 2.40 at the same 3,000 steps.

Sample output

Once upon a time, there was a little girl named Lily. She loved to play with her toys and her favorite toy, a toy truck. One day, Lily's mommy made her a yummy chocolate cake to make her happy. Lily's friend, Timmy, came over to play...

Lily and Tom went to the park and saw a big dog... "Mom, mom, the dog is coming!" Lily cried. "The dog is not mean. It was friendly and friendly. It wants to play with us."

Architecture

A LLaMA-/modded-nanoGPT-style decoder-only transformer:

Component	Choice
Layers / heads / dim	8 layers, 6 heads, `n_embd` 384
Context length	256 tokens
Vocabulary	16,384 (ByteLevel BPE)
Position encoding	RoPE
Attention	Grouped-Query Attention (2 KV heads) + QK-Norm
MLP	squared-ReLU (ungated)
Normalization	RMSNorm
Init	zero-init block output projections (muP-like)
Logits	soft-capped at 15 (`cap·tanh(logits/cap)`)
Extra heads	Multi-Token Prediction (2 auxiliary heads)
Weight tying	token embedding ↔ output head (and MTP heads)

Training


Dataset	TinyStories (~2.1M stories)
Steps	3,000
Batch	40 × 256 tokens
Optimizer	Muon (2D weights) + AdamW (embeddings/norms), peak LR 3e-3, cosine schedule
Precision	fp16 mixed precision, `torch.compile`
Hardware	1× RTX 2060 Super (8 GB), ~8 minutes
Train loss	2.47 (combined next-token + MTP auxiliary)
Validation loss	2.40 (perplexity ~11.0)

Usage

This is a custom architecture, so you need model.py from this repo (small, dependency-light). Download it next to your script, then:

import torch
from huggingface_hub import hf_hub_download
from tokenizers import Tokenizer
from model import GPT  # model.py downloaded from this repo

repo = "epoyraz/tinystories-25m"
ckpt = torch.load(
    hf_hub_download(repo, "tinystories-25m.pt"),
    map_location="cpu", weights_only=True,
)
model = GPT(ckpt["config"]).eval()
model.load_state_dict(ckpt["model"])

tok = Tokenizer.from_file(hf_hub_download(repo, "tokenizer.json"))
ids = tok.encode("Once upon a time,").ids
out = model.generate(
    torch.tensor([ids]), max_new_tokens=120, temperature=0.7, top_k=40,
)
print(tok.decode(out[0].tolist()))

pip install torch tokenizers huggingface_hub

Files

tinystories-25m.pt — checkpoint (config + model state dict)
model.py — model definition (GPT, all techniques)
config.json — the model config, for reference
tokenizer.json — ByteLevel BPE tokenizer (16K vocab)

Limitations

Trained only on TinyStories — simple children's-story English, not a general assistant.
Small and lightly trained: occasional repetition, name swaps, or drift.
256-token context.

References

Downloads last month: 44

Dataset used to train epoyraz/tinystories-25m

Papers for epoyraz/tinystories-25m

DeepSeek-V3 Technical Report

Paper • 2412.19437 • Published Dec 27, 2024 • 86

GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints

Paper • 2305.13245 • Published May 22, 2023 • 6

TinyStories: How Small Can Language Models Be and Still Speak Coherent English?

Paper • 2305.07759 • Published May 12, 2023 • 45

RoFormer: Enhanced Transformer with Rotary Position Embedding

Paper • 2104.09864 • Published Apr 20, 2021 • 17