BLT-LLM β TinyStories (β55M)
A Byte Latent Transformer (BLT) built from scratch in PyTorch and trained on TinyStories. BLT (Meta FAIR, Pagnoni et al. 2024, "Patches Scale Better Than Tokens", arXiv:2412.09871) is tokenizer-free: it reads raw bytes (0β255) and dynamically groups them into patches, spending compute where the next byte is hard to predict and skimming where it's easy.
Tiny, fully-working reference implementation trained on one 6 GB consumer GPU (RTX 3050) to
prove the architecture end-to-end. Code is included in this repo (the blt/ package).
Source / issues: https://github.com/shaikh-saud705/blt-llm
Result
| Metric | Value |
|---|---|
| Held-out bits-per-byte (BPB) | 0.71 (best 0.7078 @ step 19500) |
| Untrained baseline | 8.0 (= logβ256) |
| Params (main model) | 55.4M |
| Entropy patcher (frozen) | 1.7M |
Sample (prompt "Once upon a time", temp 0.7):
Once upon a time, in a small house, there was a boy named Tim. One day, Tim went to the store with his mom. They needed to buy a toyβ¦ Tim said to his mom, "Mom, can I give tβ¦
Run it
This repo contains both the code and the weights, so you can run it directly:
git clone https://huggingface.co/sssssaud/blt-llm-tinystories-55m
cd blt-llm-tinystories-55m
pip install -r requirements.txt
python -m blt.generate --prompt "Once upon a time" --max-new 300 --temperature 0.7
Runs on CPU if you have no GPU (just slower). The three files in checkpoints/ are all
here: blt_model_weights.pt (model), entropy_model.pt (frozen patcher),
patcher_threshold.json (ΞΈ).
Architecture
Three modules joined by two cross-attentions (all masks from patch_ids β variable-length
patches, no fixed reshape):
- Local Encoder β 1 layer, dim 256, 4 heads, windowed-causal (window 128). Byte embed + hash n-gram embeds (n=3..8). Encoder cross-attn pools bytes β patches (max-pool seed).
- Latent Global Transformer β 6 layers, dim 768, 12 heads, block-causal over patches (holds the bulk of the params).
- Local Decoder β 4 layers, dim 256, 4 heads, windowed-causal. Decoder cross-attn expands patches β bytes (byte i attends the previous patch β strictly causal).
- Shared: RMSNorm, SwiGLU, RoPE (ΞΈ=500000) in self-attn only, tied byte embed/output. k=3.
Dynamic patching by a separate, frozen entropy byte-LM (entropy_model.pt, 2Γ256, 1.7M
params): a byte starts a new patch when next-byte entropy H > ΞΈ (ΞΈ=1.0917 β avg β 4.4
bytes/patch). Causal, so it works during generation. The implementation passes a strict
no-future-leakage gate (gradient + perturbation tests).
Files
| File | What |
|---|---|
blt/ |
the model code (encoder, global, decoder, patcher, generate, β¦) |
checkpoints/blt_model_weights.pt |
main BLT weights (model-only) + embedded config + ΞΈ |
checkpoints/entropy_model.pt |
frozen entropy patcher |
checkpoints/patcher_threshold.json |
tuned threshold ΞΈ |
requirements.txt |
deps (torch, numpy, huggingface_hub) |
BLT_LLM.md |
full build spec |
Training
- Corpus:
roneneldan/TinyStories(TinyStoriesV2-GPT4 train split, ~2.2B bytes). - 20,000 steps, effective batch 16 (8 Γ grad-accum 2), seq-len 512.
- AdamW Ξ²(0.9, 0.95), wd 0.1, grad clip 1.0; cosine LR 4e-4, 300-step warmup.
- β2.4 h on an RTX 3050 6 GB. Metric: bits-per-byte (no fixed tokenizer β no perplexity).
License
MIT. BLT architecture is from Meta FAIR's paper (linked above); this is an independent from-scratch reimplementation.