BLT-LLM — TinyStories (≈55M)

A Byte Latent Transformer (BLT) built from scratch in PyTorch and trained on TinyStories. BLT (Meta FAIR, Pagnoni et al. 2024, "Patches Scale Better Than Tokens", arXiv:2412.09871) is tokenizer-free: it reads raw bytes (0–255) and dynamically groups them into patches, spending compute where the next byte is hard to predict and skimming where it's easy.

Tiny, fully-working reference implementation trained on one 6 GB consumer GPU (RTX 3050) to prove the architecture end-to-end. Code is included in this repo (the blt/ package). Source / issues: https://github.com/shaikh-saud705/blt-llm

Result

Metric	Value
Held-out bits-per-byte (BPB)	0.71 (best 0.7078 @ step 19500)
Untrained baseline	8.0 (= log₂256)
Params (main model)	55.4M
Entropy patcher (frozen)	1.7M

Sample (prompt "Once upon a time", temp 0.7):

Once upon a time, in a small house, there was a boy named Tim. One day, Tim went to the store with his mom. They needed to buy a toy… Tim said to his mom, "Mom, can I give t…

Run it

This repo contains both the code and the weights, so you can run it directly:

git clone https://huggingface.co/sssssaud/blt-llm-tinystories-55m
cd blt-llm-tinystories-55m
pip install -r requirements.txt

python -m blt.generate --prompt "Once upon a time" --max-new 300 --temperature 0.7

Runs on CPU if you have no GPU (just slower). The three files in checkpoints/ are all here: blt_model_weights.pt (model), entropy_model.pt (frozen patcher), patcher_threshold.json (θ).

Architecture

Three modules joined by two cross-attentions (all masks from patch_ids — variable-length patches, no fixed reshape):

Local Encoder — 1 layer, dim 256, 4 heads, windowed-causal (window 128). Byte embed + hash n-gram embeds (n=3..8). Encoder cross-attn pools bytes → patches (max-pool seed).
Latent Global Transformer — 6 layers, dim 768, 12 heads, block-causal over patches (holds the bulk of the params).
Local Decoder — 4 layers, dim 256, 4 heads, windowed-causal. Decoder cross-attn expands patches → bytes (byte i attends the previous patch → strictly causal).
Shared: RMSNorm, SwiGLU, RoPE (θ=500000) in self-attn only, tied byte embed/output. k=3.

Dynamic patching by a separate, frozen entropy byte-LM (entropy_model.pt, 2×256, 1.7M params): a byte starts a new patch when next-byte entropy H > θ (θ=1.0917 → avg ≈ 4.4 bytes/patch). Causal, so it works during generation. The implementation passes a strict no-future-leakage gate (gradient + perturbation tests).

Files

File	What
`blt/`	the model code (encoder, global, decoder, patcher, generate, …)
`checkpoints/blt_model_weights.pt`	main BLT weights (model-only) + embedded config + θ
`checkpoints/entropy_model.pt`	frozen entropy patcher
`checkpoints/patcher_threshold.json`	tuned threshold θ
`requirements.txt`	deps (torch, numpy, huggingface_hub)
`BLT_LLM.md`	full build spec

Training

Corpus: roneneldan/TinyStories (TinyStoriesV2-GPT4 train split, ~2.2B bytes).
20,000 steps, effective batch 16 (8 × grad-accum 2), seq-len 512.
AdamW β(0.9, 0.95), wd 0.1, grad clip 1.0; cosine LR 4e-4, 300-step warmup.
≈2.4 h on an RTX 3050 6 GB. Metric: bits-per-byte (no fixed tokenizer → no perplexity).

License

MIT. BLT architecture is from Meta FAIR's paper (linked above); this is an independent from-scratch reimplementation.

Downloads last month: -; Downloads are not tracked for this model. How to track

Dataset used to train sssssaud/blt-llm-tinystories-55m

Paper for sssssaud/blt-llm-tinystories-55m

Byte Latent Transformer: Patches Scale Better Than Tokens

Paper • 2412.09871 • Published Dec 13, 2024 • 109