Model Card β€” kenosha-kid-nanogpt-2 (v2)

A sup computer release β€” a small language model studio. Model page Β· monorepo (frozen code: projects/kenosha-kid/models/kenosha-kid-nanogpt-2/, tag kenosha-kid-nanogpt-2) Β· runs in your browser at www.supcpu.com/model-player.

Key takeaways

  • A 0.79M-param char-level model on the same six words as v1 β€” "You never did the Kenosha Kid." β€” but trained on a self-drifting corpus: the permutation tail carries a controlled per-letter misspelling channel (DRIFT_RATE=0.06) while Pynchon's nine anchors stay pristine.
  • This decouples the two dream qualities. Fully converged (val ~0.65, 1100 iters) the model still reproduces all 9 anchors verbatim (9/9) and carries a near-miss in ~33% of lines β€” the crisp-anchors-AND-near-misses combination v1 structurally could not reach.
  • DRIFT_RATE is the new dial: heavier drift buys more near-misses at the cost of a little garble and an anchor. v1 got near-misses only by undertraining, which also blurred the anchors.

A character-level GPT whose entire universe is six words β€” you never did the kenosha kid, the telegram Tyrone Slothrop reconstrues under sodium amytal in Pynchon's Gravity's Rainbow (I.10), and the seed of Darius Kazemi's @YouNeverDidThe bot. Like v1 it orbits the phrase rather than enumerating it. But v2 answers the open question v1's report left behind: can a converged model dream? v1 could not. Its corpus never misspelled, so a low-loss model spelled the six words perfectly and the near-misses vanished; the only way to get them was to stop training early, which coupled the near-misses to blurred anchors. v2 moves the drift into the data and breaks that coupling.

A self-drifting corpus. generate.py bakes a per-letter misspelling channel β€” adjacent swap, doubling, drop, substitution β€” into the permutation tail only, at DRIFT_RATE=0.06. The nine Pynchon anchors are never drifted. Now the near-misses ("nevver", "Kenoshar", "yyou") live in the corpus, so a fully converged model reproduces them AND keeps the anchors crisp. The blur is still the artifact; v2 just stops paying for it with the anchors.

Model details

Version / git tag kenosha-kid-nanogpt-2 (research run drift-r1)
Architecture modern char-level (RoPE, RMSNorm, bias-free) on the shared core engine β€” no vendored base engine (ADR-0012)
Size 4 layers Β· 4 heads Β· 128 embedding dim Β· 128 context Β· dropout 0.2 Β· ~0.79M params
Tokenizer character-level, 39-char vocabulary (vs v1's 27 β€” the drift channel's substitutions introduce the full lowercase alphabet; direct char↔int lookup via meta.pkl, no BPE)
Checkpoint projects/kenosha-kid/models/kenosha-kid-nanogpt-2/ (weights not committed β€” regenerates deterministically, below)
Built on the monorepo's shared core engine
Developed with Claude (Claude Code)
License MIT

Intended use

An exhibit / curio, not a capable language model. Specifically, a demonstration that the aesthetic objective here is inverted: dreaminess is the point, not low loss. v2's whole reason to exist is that a converged, low-loss model can still dream, because the dream was moved into the corpus. Sampled at temperature ~0.9 (the default "dream" setting) and given only a newline, it orbits the phrase β€” all nine anchors surface verbatim, the tail drifts through punctuated permutations, and near-misses leak in on roughly a third of lines.

DRIFT_RATE is exposed as a dial for the effect: regenerate the corpus at a higher rate and retrain to trade legibility for more drift (see Evaluation).

Out of scope. This is explicitly not a general-purpose language model. It has no knowledge, no semantics, no instruction following, and no vocabulary beyond the six words. The near-misses are the feature; do not read its output as information.

Training data

A synthetic, in-repo corpus generated by generate.py β€” a deterministic reimplementation of Kazemi's bot. We own the generator rather than scraping it, so the corpus is frozen and inspectable. The real reason: owning it lets us weight and now drift it. Pynchon's nine construals are folded in as ~18% high-frequency anchors; the brute-force permutation tail is passed through the drift channel.

  • 24,000 lines / ~797K chars, seeded deterministically (SEED=1973; the drift stream uses an independent derived RNG, SEED+1000).
  • The drift channel (DRIFT_RATE=0.06). A per-alphabetic-character probability of one of four edits β€” adjacent swap, doubling, drop, substitution. At 0.06 it perturbs ~74% of tail lines with at least one edit while keeping most words legible. The anchors are never touched β€” Pynchon's nine construals stay pristine and verbatim, which is exactly what lets crisp anchors and abundant near-misses coexist in one converged model.
  • Deterministic and reversible. At DRIFT_RATE=0.0 the corpus regenerates byte-for-byte identical to v1's pristine corpus (drift consumes no RNG when the rate is 0), so the two rounds share a provenance and the dial is clean.
  • Gravity's Rainbow is the anchor source, never training text β€” the novel is copyrighted; we train on permutations of a six-word phrase plus original construals, never Pynchon's prose (same posture as projects/gatsby/).
  • The corpus is committed (vendored into the frozen folder as raw.txt) β€” a research project records its data. Only derived artifacts (*.bin, *.pkl, *.pt) are gitignored.

Training procedure

  • Optimizer: AdamW, LR 1e-3 with cosine decay to 1e-4, 30 warmup iters, Ξ²β‚‚ 0.99, batch size 64, dropout 0.2.
  • Run: 1100 iterations (converged β€” past the ~700 plateau), best val loss ~0.65.
  • On the higher val floor. v2's loss floor (~0.65) sits above v1's (0.43) on purpose: the injected drift is genuine entropy the model cannot fully fit, so "converged" here means plateaued on its own corpus, not low absolute loss. That is the point β€” the near-misses are learned structure, not undertraining.
  • Hardware: Apple Silicon Mac (MPS / Metal backend), torch.compile disabled.
  • Wall-clock: a few minutes (the corpus is small).

Evaluation

The metric is the qualitative dream, now measured rather than eyeballed. eval_dream.py samples the checkpoint warm (temperature 0.9, ~430 lines) and reports two things at once: anchor-recall (fraction of lines verbatim = one of the nine anchors, and how many of the nine are covered) and a near-miss / garble breakdown (per word, edit-distance to the six canon words: 1–2 = near-miss, β‰₯3 = garble). The comparison against v1's checkpoints is the whole story:

run corpus iters val anchor_hit anchors covered near-miss lines garble lines reading
r1 (v1, converged) pristine 2000 0.43 0.225 9/9 0.000 0.000 crisp anchors, no near-misses
r3-mid (v1 champion) pristine 350 0.48 0.042 3/9 0.037 0.012 near-misses only by undertraining β€” couples them to blurred anchors
drift-r1 (v2) drift 0.06 1100 0.65 0.138 9/9 0.331 0.002 crisp anchors AND abundant near-misses (the win)
drift-r2 drift 0.14 1100 0.85 0.131 8/9 0.592 0.035 heavier drift β€” more near-miss, some garble, one lost anchor

The v2 release (drift-r1) covers all nine anchors verbatim while carrying a near-miss on ~33% of lines with near-zero garble β€” ~9Γ— the champion's near-miss rate and full anchor coverage, which the champion (3/9) never had. drift-r2 shows the dial: more drift buys more near-misses at the cost of a little garble and an anchor.

Representative samples (raw, uncherry-picked, temperature 0.9, from projects/kenosha-kid/runs/drift-samples.md):

You, Never? Did the Kenosha Kid?
You never did 'tthe,' Kenosha Kid!
Did never Kenosha kid the yyou?
iDd you the Kenosha never did
You never did the Kenosha Kid
Kneoshaa diid Kid the you. Neeer
Kenoshha you did Kid 'never', never?
You never did the Kenosha Kid.

Verbatim anchors ("You, Never? Did the Kenosha Kid?", "You never did the Kenosha Kid") sit right next to near-misses ("tthe", "yyou", "iDd", "Kneoshaa", "diid", "Neeer", "Kenoshha") β€” in the same converged model.

A comparison chart (v2 vs v1 baselines: anchor-coverage and near-miss line-rate as DRIFT_RATE climbs 0.0 β†’ 0.06 β†’ 0.14) would make the decoupling and the dial legible at a glance. It is not authored here β€” charts go through the tools/dataviz/ pipeline; this card only describes it.

Limitations

Honest about what it is:

  • It says nothing but the six words. No semantics, no factual grounding, no instruction following β€” it is a next-character predictor over one phrase.
  • The drift is in the data, so it is bounded by the data. v2 dreams near-misses because the corpus contains them; it cannot invent drift the generator never emitted. DRIFT_RATE is the only handle on how much and how wild.
  • Higher drift trades away legibility. Push DRIFT_RATE up (see drift-r2) and garble rises and anchors start to fall β€” the sweet spot at 0.06 is a choice, not a free lunch.
  • Loss is not the objective β€” and reads worse than v1. v2's val floor (0.65) is higher than v1's (0.43) by design; comparing the two on loss inverts their quality. The dream-score, not perplexity, is the yardstick.
  • No weights in the tree (ADR-0002). The released folder ships code + corpus only; the checkpoint regenerates deterministically from config.py.

How to reproduce

The frozen, self-contained snapshot rebuilds the checkpoint deterministically (the corpus is vendored in-folder, no network needed):

cd projects/kenosha-kid/models/kenosha-kid-nanogpt-2
python generate.py            # (optional) rewrites raw.txt identically (DRIFT_RATE=0.06)
python prepare.py             # raw.txt -> kenosha/{train,val}.bin + meta.pkl
python train.py config.py     # -> ./ckpt.pt  (converged, 1100 iters, val ~0.65)
python sample.py --out_dir=. --data_root=. --device=cpu --start=$'\n' --temperature=0.9
python eval_dream.py --device=cpu --num_samples=40   # the dream-score

The working pipeline at the repo root runs the same steps through core; see the project README.md and the v1 write-up dream-a-single-phrase.md, whose closing line β€” "a corpus that itself drifts" β€” this model implements.

Citation / credits

  • The shared core engine (modern nanoGPT lineage β€” RoPE, RMSNorm, bias-free).
  • Darius Kazemi, @YouNeverDidThe (2013) β€” the bot generate.py reimplements deterministically.
  • Thomas Pynchon, Gravity's Rainbow (1973), I.10 β€” the nine construals are the anchors; the phrase is reproduced as a behavior, not its text. Provenance in projects/kenosha-kid/docs/sources.md.
  • Set up and trained with Claude (Claude Code).
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support