Model Card β kenosha-kid-nanogpt-2 (v2)
A sup computer release β a small language model studio. Model page Β· monorepo (frozen code:
projects/kenosha-kid/models/kenosha-kid-nanogpt-2/, tagkenosha-kid-nanogpt-2) Β· runs in your browser at www.supcpu.com/model-player.
Key takeaways
- A 0.79M-param char-level model on the same six words as v1 β "You never did the Kenosha Kid." β but trained on a self-drifting corpus: the permutation tail carries a controlled per-letter misspelling channel (
DRIFT_RATE=0.06) while Pynchon's nine anchors stay pristine. - This decouples the two dream qualities. Fully converged (val ~0.65, 1100 iters) the model still reproduces all 9 anchors verbatim (9/9) and carries a near-miss in ~33% of lines β the crisp-anchors-AND-near-misses combination v1 structurally could not reach.
DRIFT_RATEis the new dial: heavier drift buys more near-misses at the cost of a little garble and an anchor. v1 got near-misses only by undertraining, which also blurred the anchors.
A character-level GPT whose entire universe is six words β you never did the kenosha kid, the telegram Tyrone Slothrop reconstrues under sodium amytal in Pynchon's Gravity's Rainbow (I.10), and the seed of Darius Kazemi's @YouNeverDidThe bot. Like v1 it orbits the phrase rather than enumerating it. But v2 answers the open question v1's report left behind: can a converged model dream? v1 could not. Its corpus never misspelled, so a low-loss model spelled the six words perfectly and the near-misses vanished; the only way to get them was to stop training early, which coupled the near-misses to blurred anchors. v2 moves the drift into the data and breaks that coupling.
A self-drifting corpus.
generate.pybakes a per-letter misspelling channel β adjacent swap, doubling, drop, substitution β into the permutation tail only, atDRIFT_RATE=0.06. The nine Pynchon anchors are never drifted. Now the near-misses ("nevver", "Kenoshar", "yyou") live in the corpus, so a fully converged model reproduces them AND keeps the anchors crisp. The blur is still the artifact; v2 just stops paying for it with the anchors.
Model details
| Version / git tag | kenosha-kid-nanogpt-2 (research run drift-r1) |
| Architecture | modern char-level (RoPE, RMSNorm, bias-free) on the shared core engine β no vendored base engine (ADR-0012) |
| Size | 4 layers Β· 4 heads Β· 128 embedding dim Β· 128 context Β· dropout 0.2 Β· ~0.79M params |
| Tokenizer | character-level, 39-char vocabulary (vs v1's 27 β the drift channel's substitutions introduce the full lowercase alphabet; direct charβint lookup via meta.pkl, no BPE) |
| Checkpoint | projects/kenosha-kid/models/kenosha-kid-nanogpt-2/ (weights not committed β regenerates deterministically, below) |
| Built on | the monorepo's shared core engine |
| Developed with | Claude (Claude Code) |
| License | MIT |
Intended use
An exhibit / curio, not a capable language model. Specifically, a demonstration that the aesthetic objective here is inverted: dreaminess is the point, not low loss. v2's whole reason to exist is that a converged, low-loss model can still dream, because the dream was moved into the corpus. Sampled at temperature ~0.9 (the default "dream" setting) and given only a newline, it orbits the phrase β all nine anchors surface verbatim, the tail drifts through punctuated permutations, and near-misses leak in on roughly a third of lines.
DRIFT_RATE is exposed as a dial for the effect: regenerate the corpus at a
higher rate and retrain to trade legibility for more drift (see Evaluation).
Out of scope. This is explicitly not a general-purpose language model. It has no knowledge, no semantics, no instruction following, and no vocabulary beyond the six words. The near-misses are the feature; do not read its output as information.
Training data
A synthetic, in-repo corpus generated by generate.py β a deterministic
reimplementation of Kazemi's bot. We own the generator rather than scraping it, so
the corpus is frozen and inspectable. The real reason: owning it lets us weight
and now drift it. Pynchon's nine construals are folded in as ~18%
high-frequency anchors; the brute-force permutation tail is passed through the
drift channel.
- 24,000 lines / ~797K chars, seeded deterministically (
SEED=1973; the drift stream uses an independent derived RNG,SEED+1000). - The drift channel (
DRIFT_RATE=0.06). A per-alphabetic-character probability of one of four edits β adjacent swap, doubling, drop, substitution. At 0.06 it perturbs ~74% of tail lines with at least one edit while keeping most words legible. The anchors are never touched β Pynchon's nine construals stay pristine and verbatim, which is exactly what lets crisp anchors and abundant near-misses coexist in one converged model. - Deterministic and reversible. At
DRIFT_RATE=0.0the corpus regenerates byte-for-byte identical to v1's pristine corpus (drift consumes no RNG when the rate is 0), so the two rounds share a provenance and the dial is clean. - Gravity's Rainbow is the anchor source, never training text β the novel is
copyrighted; we train on permutations of a six-word phrase plus original
construals, never Pynchon's prose (same posture as
projects/gatsby/). - The corpus is committed (vendored into the frozen folder as
raw.txt) β a research project records its data. Only derived artifacts (*.bin,*.pkl,*.pt) are gitignored.
Training procedure
- Optimizer: AdamW, LR 1e-3 with cosine decay to 1e-4, 30 warmup iters, Ξ²β 0.99, batch size 64, dropout 0.2.
- Run: 1100 iterations (converged β past the ~700 plateau), best val loss ~0.65.
- On the higher val floor. v2's loss floor (~0.65) sits above v1's (0.43) on purpose: the injected drift is genuine entropy the model cannot fully fit, so "converged" here means plateaued on its own corpus, not low absolute loss. That is the point β the near-misses are learned structure, not undertraining.
- Hardware: Apple Silicon Mac (MPS / Metal backend),
torch.compiledisabled. - Wall-clock: a few minutes (the corpus is small).
Evaluation
The metric is the qualitative dream, now measured rather than eyeballed.
eval_dream.py samples the checkpoint warm (temperature 0.9, ~430 lines) and
reports two things at once: anchor-recall (fraction of lines verbatim = one of
the nine anchors, and how many of the nine are covered) and a near-miss / garble
breakdown (per word, edit-distance to the six canon words: 1β2 = near-miss, β₯3 =
garble). The comparison against v1's checkpoints is the whole story:
| run | corpus | iters | val | anchor_hit | anchors covered | near-miss lines | garble lines | reading |
|---|---|---|---|---|---|---|---|---|
r1 (v1, converged) |
pristine | 2000 | 0.43 | 0.225 | 9/9 | 0.000 | 0.000 | crisp anchors, no near-misses |
r3-mid (v1 champion) |
pristine | 350 | 0.48 | 0.042 | 3/9 | 0.037 | 0.012 | near-misses only by undertraining β couples them to blurred anchors |
drift-r1 (v2) |
drift 0.06 | 1100 | 0.65 | 0.138 | 9/9 | 0.331 | 0.002 | crisp anchors AND abundant near-misses (the win) |
drift-r2 |
drift 0.14 | 1100 | 0.85 | 0.131 | 8/9 | 0.592 | 0.035 | heavier drift β more near-miss, some garble, one lost anchor |
The v2 release (drift-r1) covers all nine anchors verbatim while carrying a
near-miss on ~33% of lines with near-zero garble β ~9Γ the champion's
near-miss rate and full anchor coverage, which the champion (3/9) never had.
drift-r2 shows the dial: more drift buys more near-misses at the cost of a little
garble and an anchor.
Representative samples (raw, uncherry-picked, temperature 0.9, from
projects/kenosha-kid/runs/drift-samples.md):
You, Never? Did the Kenosha Kid?
You never did 'tthe,' Kenosha Kid!
Did never Kenosha kid the yyou?
iDd you the Kenosha never did
You never did the Kenosha Kid
Kneoshaa diid Kid the you. Neeer
Kenoshha you did Kid 'never', never?
You never did the Kenosha Kid.
Verbatim anchors ("You, Never? Did the Kenosha Kid?", "You never did the Kenosha Kid") sit right next to near-misses ("tthe", "yyou", "iDd", "Kneoshaa", "diid", "Neeer", "Kenoshha") β in the same converged model.
A comparison chart (v2 vs v1 baselines: anchor-coverage and near-miss line-rate as
DRIFT_RATEclimbs 0.0 β 0.06 β 0.14) would make the decoupling and the dial legible at a glance. It is not authored here β charts go through thetools/dataviz/pipeline; this card only describes it.
Limitations
Honest about what it is:
- It says nothing but the six words. No semantics, no factual grounding, no instruction following β it is a next-character predictor over one phrase.
- The drift is in the data, so it is bounded by the data. v2 dreams near-misses
because the corpus contains them; it cannot invent drift the generator never
emitted.
DRIFT_RATEis the only handle on how much and how wild. - Higher drift trades away legibility. Push
DRIFT_RATEup (seedrift-r2) and garble rises and anchors start to fall β the sweet spot at 0.06 is a choice, not a free lunch. - Loss is not the objective β and reads worse than v1. v2's val floor (0.65) is higher than v1's (0.43) by design; comparing the two on loss inverts their quality. The dream-score, not perplexity, is the yardstick.
- No weights in the tree (ADR-0002).
The released folder ships code + corpus only; the checkpoint regenerates
deterministically from
config.py.
How to reproduce
The frozen, self-contained snapshot rebuilds the checkpoint deterministically (the corpus is vendored in-folder, no network needed):
cd projects/kenosha-kid/models/kenosha-kid-nanogpt-2
python generate.py # (optional) rewrites raw.txt identically (DRIFT_RATE=0.06)
python prepare.py # raw.txt -> kenosha/{train,val}.bin + meta.pkl
python train.py config.py # -> ./ckpt.pt (converged, 1100 iters, val ~0.65)
python sample.py --out_dir=. --data_root=. --device=cpu --start=$'\n' --temperature=0.9
python eval_dream.py --device=cpu --num_samples=40 # the dream-score
The working pipeline at the repo root runs the same steps through core; see the
project README.md and the v1 write-up
dream-a-single-phrase.md, whose closing
line β "a corpus that itself drifts" β this model implements.
Citation / credits
- The shared
coreengine (modern nanoGPT lineage β RoPE, RMSNorm, bias-free). - Darius Kazemi, @YouNeverDidThe (2013) β the bot
generate.pyreimplements deterministically. - Thomas Pynchon, Gravity's Rainbow (1973), I.10 β the nine construals are the
anchors; the phrase is reproduced as a behavior, not its text. Provenance in
projects/kenosha-kid/docs/sources.md. - Set up and trained with Claude (Claude Code).