BLT-LLM β€” TinyStories (β‰ˆ55M)

A Byte Latent Transformer (BLT) built from scratch in PyTorch and trained on TinyStories. BLT (Meta FAIR, Pagnoni et al. 2024, "Patches Scale Better Than Tokens", arXiv:2412.09871) is tokenizer-free: it reads raw bytes (0–255) and dynamically groups them into patches, spending compute where the next byte is hard to predict and skimming where it's easy.

Tiny, fully-working reference implementation trained on one 6 GB consumer GPU (RTX 3050) to prove the architecture end-to-end. Code is included in this repo (the blt/ package). Source / issues: https://github.com/shaikh-saud705/blt-llm

Result

Metric Value
Held-out bits-per-byte (BPB) 0.71 (best 0.7078 @ step 19500)
Untrained baseline 8.0 (= logβ‚‚256)
Params (main model) 55.4M
Entropy patcher (frozen) 1.7M

Sample (prompt "Once upon a time", temp 0.7):

Once upon a time, in a small house, there was a boy named Tim. One day, Tim went to the store with his mom. They needed to buy a toy… Tim said to his mom, "Mom, can I give t…

Run it

This repo contains both the code and the weights, so you can run it directly:

git clone https://huggingface.co/sssssaud/blt-llm-tinystories-55m
cd blt-llm-tinystories-55m
pip install -r requirements.txt

python -m blt.generate --prompt "Once upon a time" --max-new 300 --temperature 0.7

Runs on CPU if you have no GPU (just slower). The three files in checkpoints/ are all here: blt_model_weights.pt (model), entropy_model.pt (frozen patcher), patcher_threshold.json (ΞΈ).

Architecture

Three modules joined by two cross-attentions (all masks from patch_ids β€” variable-length patches, no fixed reshape):

  • Local Encoder β€” 1 layer, dim 256, 4 heads, windowed-causal (window 128). Byte embed + hash n-gram embeds (n=3..8). Encoder cross-attn pools bytes β†’ patches (max-pool seed).
  • Latent Global Transformer β€” 6 layers, dim 768, 12 heads, block-causal over patches (holds the bulk of the params).
  • Local Decoder β€” 4 layers, dim 256, 4 heads, windowed-causal. Decoder cross-attn expands patches β†’ bytes (byte i attends the previous patch β†’ strictly causal).
  • Shared: RMSNorm, SwiGLU, RoPE (ΞΈ=500000) in self-attn only, tied byte embed/output. k=3.

Dynamic patching by a separate, frozen entropy byte-LM (entropy_model.pt, 2Γ—256, 1.7M params): a byte starts a new patch when next-byte entropy H > ΞΈ (ΞΈ=1.0917 β†’ avg β‰ˆ 4.4 bytes/patch). Causal, so it works during generation. The implementation passes a strict no-future-leakage gate (gradient + perturbation tests).

Files

File What
blt/ the model code (encoder, global, decoder, patcher, generate, …)
checkpoints/blt_model_weights.pt main BLT weights (model-only) + embedded config + ΞΈ
checkpoints/entropy_model.pt frozen entropy patcher
checkpoints/patcher_threshold.json tuned threshold ΞΈ
requirements.txt deps (torch, numpy, huggingface_hub)
BLT_LLM.md full build spec

Training

  • Corpus: roneneldan/TinyStories (TinyStoriesV2-GPT4 train split, ~2.2B bytes).
  • 20,000 steps, effective batch 16 (8 Γ— grad-accum 2), seq-len 512.
  • AdamW Ξ²(0.9, 0.95), wd 0.1, grad clip 1.0; cosine LR 4e-4, 300-step warmup.
  • β‰ˆ2.4 h on an RTX 3050 6 GB. Metric: bits-per-byte (no fixed tokenizer β†’ no perplexity).

License

MIT. BLT architecture is from Meta FAIR's paper (linked above); this is an independent from-scratch reimplementation.

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Dataset used to train sssssaud/blt-llm-tinystories-55m

Paper for sssssaud/blt-llm-tinystories-55m