Glassbox GPT — Shakespeare (char-level)

A small GPT built entirely from scratch — autograd engine, attention, the lot — and trained on character-level Tiny Shakespeare. It's tiny (10.8M parameters) and it only knows characters, yet it writes recognisable Shakespearean verse: speaker tags, archaic diction, the rhythm of a play.

This model is the first flagship of Glassbox — a project to learn modern ML in public by building small versions of everything from nothing and explaining each piece visually. The point isn't a useful model; it's a transparent one. Every line of it is understood.

What it can do (a real sample, temperature 0.8)

Good sir, what way, you will shame.

Servant:
For woman, is gone?

Clown:
He at the queen in the king is the properly
here person's earth, he would have never been
the good old of words; holding...

The journey that produced it

Each model below was built from scratch, in order; the loss is the proof that each idea helped.

model	context it sees	val loss (↓)	what it writes
bigram (counting)	1 character	2.49	gibberish
1 self-attention block	32 characters	2.01	word-shaped fragments
this model (6-layer GPT)	256 characters	1.48	Shakespearean verse

The ablation: attention is the transformer

Trained twice, identical in every way (same params, seed, data, steps) — flipping one switch: may attention look at other tokens, or only at itself?

attention	val loss
on (causal)	1.64
off (self-only)	2.49 ← exactly the bigram baseline

Cripple attention and a 4-layer transformer is no better than counting letter pairs. The depth and parameters are worthless without the ability to look across tokens. That's the whole thesis of the transformer, measured.

Use it

Pure PyTorch — no transformers dependency. The architecture ships with the model (modeling_glassbox.py).

from huggingface_hub import snapshot_download
import sys
d = snapshot_download("uditmukherjee/glassbox-gpt-shakespeare")
sys.path.insert(0, d)
from modeling_glassbox import GlassboxGPT

model = GlassboxGPT.from_pretrained(d)
print(model.generate("ROMEO:\n", max_new_tokens=500, temperature=0.8))

temperature controls wildness (≈0.7 safe, ≈1.2 unhinged). Lower-case names like ROMEO: or KING HENRY: make decent seeds.

Specs & training

Architecture: decoder-only transformer (nanoGPT-style) — token + learned positional embeddings, 6 layers × 6 heads, n_embd 384, context 256, pre-LayerNorm blocks, residual connections.
Parameters: 10.8M · Vocab: 65 characters · Tokenizer: character-level.
Data: Tiny Shakespeare (~1.1MB, 90/10 train/val split).
Training: 4,000 steps, AdamW (lr 3e-4), batch 32, dropout 0.2, bf16 autocast, fixed seed 1337, on a single consumer GPU (RTX 4070 Ti SUPER, 16GB). Final train 1.12 / val 1.48 — the gap is honest, visible overfitting.

Limitations (read this)

It's a toy, on purpose. Character-level, 10.8M params, trained only on one Shakespeare file.
Not factual, not instructable, not safe-for-anything — it models the texture of Shakespeare, nothing more. It will produce non-words and nonsense.
English / Early-Modern-Shakespeare only. No other domain, language, or task.
Educational artifact. Don't deploy it for anything real.

License

MIT. Tiny Shakespeare is public-domain text. Built with nanoGPT as the north star.

Downloads last month: 7

Safetensors

Model size

10.8M params

Tensor type

F32

uditmukherjee
/

glassbox-gpt-shakespeare