tiny-38m / README.md
darthcrawl's picture
Add files using upload-large-folder tool
6e14144 verified
metadata
license: apache-2.0
language:
  - en
library_name: pytorch
tags:
  - causal-lm
  - pretrained-from-scratch
  - small-lm
  - gpt
datasets:
  - roneneldan/TinyStories
  - roneneldan/TinyStoriesInstruct
  - wikimedia/wikipedia
  - nampdn-ai/tiny-textbooks
pipeline_tag: text-generation

tiny-38m

A 37.8M-parameter decoder-only transformer pretrained from zero on a mix of small, simple-vocabulary corpora. Pure PyTorch, single GPU, no HF Trainer, no PEFT, no distillation.

Educational artifact. Demonstrates that the modern transformer recipe (RMSNorm + RoPE + SwiGLU + SDPA) reaches coherent output at small scale on a single GPU.

Quick start

import json, sys, torch
from pathlib import Path
from huggingface_hub import snapshot_download
from tokenizers import Tokenizer
from safetensors.torch import load_file

local = snapshot_download("darthcrawl/tiny-38m")
sys.path.insert(0, local)
from config import ModelConfig
from model import GPT

cfg_dict = json.loads((Path(local) / "config.json").read_text())
valid = {f for f in ModelConfig.__dataclass_fields__}
cfg = ModelConfig(**{k: v for k, v in cfg_dict.items() if k in valid})

model = GPT(cfg).eval()
model.load_state_dict(load_file(f"{local}/model.safetensors"), strict=False)

tok = Tokenizer.from_file(f"{local}/tokenizer.json")
eot = tok.token_to_id("<|endoftext|>")

ids = torch.tensor([tok.encode("Once upon a time, there was a small dragon").ids], dtype=torch.long)
out = model.generate(ids, max_new_tokens=200, temperature=0.8, top_k=200, eos_id=eot)
print(tok.decode(out[0].tolist()))

strict=False is required because tied embeddings (lm_head.weight = tok_emb.weight) get stored once.

Architecture

Type Decoder-only transformer
Parameters 37.8M
Layers 8
Hidden dim 512
Attention heads 8
Context length 1024
Vocab size 8192
Position encoding RoPE
Norm RMSNorm (pre-norm)
MLP SwiGLU
Attention PyTorch SDPA, causal
Embedding tying Yes

Training

Source mix tinystories:60,tinystories_instruct:15,simple_wiki:15,tiny_textbooks:10
Total train tokens 477521740
Best ckpt step 19500
Best val loss 1.8847
Optimizer AdamW (β=(0.9, 0.95), wd=0.1)
Peak LR 0.0006
LR schedule Cosine, 200-step warmup
Batch size 32 × grad_accum 4
Precision bfloat16 (AMP)
Hardware Single GPU

Mix format is name:weight,.... meta.txt in this repo is the canonical record.

Tokenizer

Byte-level BPE trained on the same source mix. Single tokenizer.json (HuggingFace tokenizers format), 8192 merges. Special tokens: <|endoftext|> (eot/eos), <|pad|>.

What it can do

  • Continue toddler-level English narratives in TinyStories register.
  • Produce short factual-sounding text in the simple-Wikipedia register.
  • Follow basic prompt → story patterns from TinyStoriesInstruct.

What it can't do

  • General-knowledge QA, code, math, multi-turn chat, reasoning, instructions beyond what was in the training mix.
  • Out-of-distribution vocabulary. Vocab is small and the corpus is intentionally narrow.
  • Reliable factuality. Even on simple-wiki-style prompts it will confabulate.

Intended use

Education, replication, ablations, baseline for from-scratch pretraining experiments. Not for downstream production.

Limitations and bias

Inherits whatever biases live in the synthetic TinyStories corpora and Simple English Wikipedia. Outputs are not safe for any user-facing application. No safety alignment, no instruction tuning, no RLHF.

Reproducibility

Inference code (model.py, config.py, sample.py) ships in this repo. Full training pipeline (tokenizer, data prep, training loop, source mixing) is in the upstream project.

License

Apache 2.0 for code and weights. Training data licenses follow their respective sources (see Datasets in metadata).