EBT × spectral control — load & play
Companion artifacts for "Replacing EBT's stability heuristics with principled spectral control" (toy + small-transformer study of Energy-Based Transformers, Gladstone et al. arXiv:2507.02092).
A real (tiny, 1.26M-param) causal-transformer EBT: per-token energy E(h_t, ŷ) over a continuous
C-dim candidate, inner gradient descent on ŷ, second-order training, TinyStories-BPE (vocab 4096).
λmax for the adaptive inner step is estimated by power-iteration on the HVP — no exact Hessian.
Checkpoints
| file | recipe | val CE (random = 8.32) |
|---|---|---|
ebt_baseline.pt |
esharp=1 (well-posed), plain | ~5 |
ebt_naked_esharp8.pt |
sharpened energy init, NO control | ~350 (diverged — the failure) |
ebt_ours_esharp8.pt |
same sharpened init + spectral control (α=c/λmax, power-iteration) | ~5.7 (recovers baseline) |
Same sharpened landscape: naked/Langevin/clamp all diverge (227–381); only the spectral control trains.
The point: EBT's stability heuristics are stand-ins for landscape conditioning; α·λ<2 replaces them
with a guarantee — provable, tuning-free (replaces the randomized-step-size heuristic).
Load & play
See the companion notebook (EBT_spectral_control.ipynb): downloads these checkpoints, evaluates
val CE, races adaptive-α vs fixed-α inner optimization, and generates text token-by-token so you can
watch the per-token "thinking" (inner optimization) — comparing the controlled vs uncontrolled model.
Data: val.bin (8 MB TinyStories-BPE val shard, uint16) + tokenizer.json (BPE, vocab 4096).
Training script: ebt_small.py (e.g. python ebt_small.py --esharp 8 --adapt_c 1.0 --diag_speed).
2D toy with exact 2×2 Hessian: toy_ebt.py.