Glassbox GPT β Shakespeare (char-level)
A small GPT built entirely from scratch β autograd engine, attention, the lot β and trained on character-level Tiny Shakespeare. It's tiny (10.8M parameters) and it only knows characters, yet it writes recognisable Shakespearean verse: speaker tags, archaic diction, the rhythm of a play.
This model is the first flagship of Glassbox β a project to learn modern ML in public by building small versions of everything from nothing and explaining each piece visually. The point isn't a useful model; it's a transparent one. Every line of it is understood.
What it can do (a real sample, temperature 0.8)
Good sir, what way, you will shame.
Servant:
For woman, is gone?
Clown:
He at the queen in the king is the properly
here person's earth, he would have never been
the good old of words; holding...
The journey that produced it
Each model below was built from scratch, in order; the loss is the proof that each idea helped.
| model | context it sees | val loss (β) | what it writes |
|---|---|---|---|
| bigram (counting) | 1 character | 2.49 | gibberish |
| 1 self-attention block | 32 characters | 2.01 | word-shaped fragments |
| this model (6-layer GPT) | 256 characters | 1.48 | Shakespearean verse |
The ablation: attention is the transformer
Trained twice, identical in every way (same params, seed, data, steps) β flipping one switch: may attention look at other tokens, or only at itself?
| attention | val loss |
|---|---|
| on (causal) | 1.64 |
| off (self-only) | 2.49 β exactly the bigram baseline |
Cripple attention and a 4-layer transformer is no better than counting letter pairs. The depth and parameters are worthless without the ability to look across tokens. That's the whole thesis of the transformer, measured.
Use it
Pure PyTorch β no transformers dependency. The architecture ships with the model (modeling_glassbox.py).
from huggingface_hub import snapshot_download
import sys
d = snapshot_download("uditmukherjee/glassbox-gpt-shakespeare")
sys.path.insert(0, d)
from modeling_glassbox import GlassboxGPT
model = GlassboxGPT.from_pretrained(d)
print(model.generate("ROMEO:\n", max_new_tokens=500, temperature=0.8))
temperature controls wildness (β0.7 safe, β1.2 unhinged). Lower-case names like ROMEO: or
KING HENRY: make decent seeds.
Specs & training
- Architecture: decoder-only transformer (nanoGPT-style) β token + learned positional embeddings,
6 layers Γ 6 heads,
n_embd384, context 256, pre-LayerNorm blocks, residual connections. - Parameters: 10.8M Β· Vocab: 65 characters Β· Tokenizer: character-level.
- Data: Tiny Shakespeare (~1.1MB, 90/10 train/val split).
- Training: 4,000 steps, AdamW (lr 3e-4), batch 32, dropout 0.2, bf16 autocast, fixed seed 1337, on a single consumer GPU (RTX 4070 Ti SUPER, 16GB). Final train 1.12 / val 1.48 β the gap is honest, visible overfitting.
Limitations (read this)
- It's a toy, on purpose. Character-level, 10.8M params, trained only on one Shakespeare file.
- Not factual, not instructable, not safe-for-anything β it models the texture of Shakespeare, nothing more. It will produce non-words and nonsense.
- English / Early-Modern-Shakespeare only. No other domain, language, or task.
- Educational artifact. Don't deploy it for anything real.
Links
- π Interactive demo β β type a seed and watch it write Shakespeare live, with a temperature dial.
- π¬ Visual explainers: interactive walkthroughs of gradients, embeddings, attention, and this ablation β built alongside the model.
- Part of Glassbox, an in-public ML learning project.
License
MIT. Tiny Shakespeare is public-domain text. Built with nanoGPT as the north star.
- Downloads last month
- 7