NanoGPT β€” Shakespeare

A character-level GPT transformer trained from scratch on the complete works of Shakespeare.

Built as part of a mini AI lab β€” demonstrating the full lifecycle of an AI model: Train β†’ Deploy β†’ Chat β†’ Evaluate


What this is

This is a minimal transformer language model β€” the same architecture as GPT, just smaller. It reads text one character at a time and learns to predict the next character. After training on Shakespeare, it generates text that sounds like Shakespeare.

Not because it memorized lines. Because it learned the patterns.


Architecture decisions

Setting Value Why
Embedding dim 256 Small enough to train on a laptop. Large enough to learn vocabulary and style.
Attention heads 8 Each head specializes β€” some track character names, others track meter and rhythm.
Transformer layers 6 Depth gives compositional power. Early layers learn characters, later layers learn structure.
Context window 256 characters Long enough for a full speech. Short enough to fit in memory.
Vocab size 65 Every unique character in Shakespeare. Character-level β€” no subword tokenization.
Parameters ~4.8M Deliberately small. The goal is understanding, not scale.

Training

Dataset Complete works of Shakespeare (~1.1MB, ~1M characters)
Split 90% train / 10% validation
Optimizer AdamW (lr=3e-4)
Batch size 32 sequences Γ— 256 characters
Iterations 5,000
Final train loss 1.1152
Final val loss 1.4818
Hardware Apple Silicon (MPS)
Time ~25 minutes

The train/val loss gap is small β€” the model learned generalizable patterns, not just memorized lines.


Loss curve

Training Loss

The curve shows the model going from random guessing (loss ~4.2) to genuine pattern recognition (loss ~1.1). Each step: predict the next character β†’ measure how wrong β†’ adjust all 4.82M numbers slightly β†’ repeat.


Sample output

The way to be the gentleman king stones,
And then to be dull of the clouds of your mouth:
I'll give your daughter with you outrage,
When your slower to the cur o' the house, which shall
Follows on me.

FLORIZEL:
His good hence, I do call you thrice;
And will you not be proved with great forfeit corse,
No more to it? my noble papers lord, but he was gone.

LADY CAPULET:
And that propulate more, when many cheek,
Were it to call not foul would steal at it;
But I shall not know that I were dance

Not perfect English. But recognizably Shakespeare β€” character names, iambic rhythm, stage dialogue structure, period vocabulary. All learned from predicting one character at a time.


The key insight

GPT-4 has 175 billion parameters. This model has 4.82 million. The architecture is identical. The same attention mechanism. The same residual connections. The same training loop. The difference is purely scale β€” larger embeddings, more layers, more data, more compute.

When someone says "scaling laws" β€” this is what they mean. More parameters + more data + more compute = smarter model. The architecture doesn't change. Just the dials.


Files

File Description
model.pt Trained checkpoint β€” weights + config + vocab mappings
config.json Architecture configuration
loss_curve.png Training and validation loss over 5,000 steps
sample_output.txt Text generated by the trained model

Code

Full training code available on GitHub. Includes:

  • model.py β€” transformer architecture with detailed comments explaining every primitive
  • train.py β€” training loop with explanations of every concept
  • upload_to_hf.py β€” this upload script
Downloads last month
45
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support