Model Card β€” shakespeare-nanogpt-2 (v2)

A sup computer release β€” a small language model studio. Model page Β· monorepo (frozen code: projects/shakespeare/models/shakespeare-nanogpt-2/, tag shakespeare-nanogpt-2) Β· run it in your terminal with the repo's sup CLI.

Key takeaways

  • The improved model: full corpus + modern architecture (RoPE, RMSNorm, bias-free) + GPT-2 BPE, reaching held-out BPC 1.919 (βˆ’20% vs. the v1 baseline).
  • The best of Experiment 01's four rounds (Round 3, early-stopped BPE). The Round 4 "champion" that stacked more regularization regressed.
  • Still mimicry, and scores are single-seed β€” data is the ceiling the next version will have to raise.

The current best model in the shakespeare-nanogpt series: full corpus + modern architecture + BPE tokenizer, held-out BPC 2.395 β†’ 1.919. It is the winner of a four-round LLM-assisted research experiment in which Claude Opus 4.8 acted as the researcher β€” diagnosing, changing, retraining, and measuring β€” while this small model was the thing being improved. This is the precursor stage to recursive self-improvement, not RSI itself. v2 is Round 3 of that experiment.

Series note. Successor to shakespeare-nanogpt-1. Both versions and the full story are in MODELS.md; the write-up is Experiment 01 (report index: research-docs/reports/) and the scoreboard is leaderboard.md.

Model details

Version / git tag shakespeare-nanogpt-2
Origin LLM-assisted research experiment, Round 3 (projects/shakespeare/runs/r3-bpe)
Architecture modern β€” RoPE, RMSNorm, bias-free (core/nanogpt_core/model.py)
Size ~29.9M params
Tokenizer GPT-2 byte-pair encoding (~50k vocab)
Checkpoint models/shakespeare-nanogpt-2/ckpt.pt (weights not committed β€” rebuild below)
Built on nanoGPT by Andrej Karpathy (MIT)
Developed with Claude Opus 4.8 (Claude Code) as researcher, human oversight
License MIT

Intended use

Same as v1 β€” a learning project, here additionally demonstrating LLM-assisted model development (a large model improving a small one, measured honestly). Generated text is higher-quality Shakespeare-styled mimicry than v1, but still not coherent or factual.

Out of scope: real use of the text; any presentation of output as genuine Shakespeare or as fact. No instruction following, no safety tuning.

Training data

The Complete Works of Shakespeare (~5 MB), ~5Γ— the data of v1's Tiny Shakespeare. Prepared as GPT-2 BPE tokens (projects/shakespeare/data/shakespeare_full_bpe/). A fixed 250k-character held-out test set (projects/shakespeare/test.txt) that no model trains on is used for evaluation.

Training procedure

Trained with core/nanogpt_core/train.py on the same Apple Silicon MPS setup as v1 (~20 min). Note: the model overfits β€” validation loss bottomed around step ~1000 then rose; the save-best-val policy automatically kept the early, best checkpoint.

Evaluation

Scored on the fixed held-out test in bits-per-character (BPC) β€” a tokenizer-agnostic metric, so char-level and BPE models are directly comparable. Lower is better.

Three levers worked; the fourth backfired. The bars step down through Round 3, then tick back up at Round 4.

Test BPC across baseline and four research rounds

Round Change Test BPC Worked?
β€” 1MB control (data-starved baseline) 2.395 β€”
1 5Γ— more data (full Complete Works) 2.036 yes, βˆ’15%
2 Modern architecture (RoPE + RMSNorm + bias-free) 2.004 yes, βˆ’1.6%
3 GPT-2 BPE tokenizer 1.919 πŸ† yes, βˆ’4.3%
4 "Champion" (+ dropout 0.3 + 4000 iters) 1.947 no β€” regressed

End to end: BPC 2.395 β†’ 1.919, a 20% reduction. v2 is Round 3, which already combines all three productive levers.

Researcher efficiency

We also tracked Claude's own token cost per round, to ask how much intelligence each unit of improvement cost. Round 1 (fixing the data bottleneck) paid off hugely; later rounds returned far less for similar effort, and Round 4 spent the most and went backwards.

BPC reduction per 100K Claude tokens, by round

Charts generated by dataviz/.

Limitations

  • Data-bottlenecked. Round 4 added dropout + longer training to push past R3 and regressed (1.919 β†’ 1.947), samples collapsing into "Romeo, Romeo" loops. You cannot regularize away too little data β€” this is the most instructive result, and v2's ceiling is data.
  • Overfits despite the larger corpus (see Training procedure).
  • Single-seed measurements. Every score comes from one training run at a fixed seed (1337), with no variance estimate. The two smallest table deltas β€” the modern-architecture βˆ’1.6% and the champion's +1.5% regression β€” are within plausible run-to-run noise and should be read as directionally uncertain; the large gains (more data, BPE) are well outside any plausible noise. The seed is now a recorded --seed knob, and future versions report multiple seeds.
  • Still mimicry: more fluent than v1, but no meaning, knowledge, or factuality; no instruction following or safety tuning.

How to use

# self-contained v2 folder (weights are gitignored β€” rebuild them)
cd models/shakespeare-nanogpt-2
python prepare.py     # downloads the Complete Works, BPE-encodes it here
python train.py       # -> ./ckpt.pt
python eval.py        # score on the shared held-out test (expect BPC ~1.919)
python sample.py --start="ROMEO:"

Citation / credits

  • nanoGPT by Andrej Karpathy (MIT) β€” model + training code.
  • The Complete Works of Shakespeare (public domain).
  • LLM-assisted research experiment run with Claude Opus 4.8 (Claude Code) as researcher, human oversight.

Addendum β€” June 2026

Added in the site-standardization pass (ADR-0015). The card above was unchanged by that pass; this is a tracked addendum. (A later house-style pass, 2026-07-04, edited the card's prose β€” emphasis and sentence structure only, no facts.) Site-wide fixes β€” repo links now resolve to GitHub/site routes, code blocks render within the column β€” apply automatically.

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support