Upload README.md
Browse files
README.md
ADDED
|
@@ -0,0 +1,71 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# nanoGPT Tutorial
|
| 2 |
+
|
| 3 |
+
A step-by-step implementation of a tiny GPT model from scratch in pure PyTorch.
|
| 4 |
+
|
| 5 |
+
## What is this?
|
| 6 |
+
|
| 7 |
+
This repository contains a complete, tutorial-style implementation of a small GPT (Generative Pre-trained Transformer) trained on tiny Shakespeare. Every line of code is commented to explain **what** it does and **why**.
|
| 8 |
+
|
| 9 |
+
## Files
|
| 10 |
+
|
| 11 |
+
| File | Purpose |
|
| 12 |
+
|------|---------|
|
| 13 |
+
| `model.py` | The full GPT architecture: CausalSelfAttention, MLP, Block, GPT |
|
| 14 |
+
| `prepare.py` | Data preparation: character-level tokenization, train/val split |
|
| 15 |
+
| `train.py` | Training loop with AdamW, cosine LR schedule, and generation |
|
| 16 |
+
| `input.txt` | The tiny Shakespeare dataset (~1.1M characters, 65 unique chars) |
|
| 17 |
+
| `data.pt` | Preprocessed tensors (generated by `prepare.py`) |
|
| 18 |
+
| `best.pt` | Best model checkpoint (generated by `train.py`) |
|
| 19 |
+
|
| 20 |
+
## Model Architecture
|
| 21 |
+
|
| 22 |
+
```
|
| 23 |
+
GPT(
|
| 24 |
+
wte (Embedding): vocab_size -> n_embd (token embeddings)
|
| 25 |
+
wpe (Embedding): block_size -> n_embd (position embeddings)
|
| 26 |
+
h (6x Block):
|
| 27 |
+
ln_1 (LayerNorm)
|
| 28 |
+
attn (CausalSelfAttention: multi-head self-attention with causal mask)
|
| 29 |
+
ln_2 (LayerNorm)
|
| 30 |
+
mlp (MLP: expand 4x -> GELU -> project back)
|
| 31 |
+
ln_f (LayerNorm)
|
| 32 |
+
lm_head (Linear): n_embd -> vocab_size (next-token prediction)
|
| 33 |
+
)
|
| 34 |
+
```
|
| 35 |
+
|
| 36 |
+
**Key design choices:**
|
| 37 |
+
- **Character-level vocabulary** — no tokenizer library needed
|
| 38 |
+
- **Pre-LayerNorm** residuals — standard in modern transformers
|
| 39 |
+
- **Weight tying** — shared weights between input embedding and output projection
|
| 40 |
+
- **Causal (autoregressive) attention** — can only attend to past tokens
|
| 41 |
+
|
| 42 |
+
## How to Run
|
| 43 |
+
|
| 44 |
+
```bash
|
| 45 |
+
# 1. Prepare data
|
| 46 |
+
python prepare.py
|
| 47 |
+
|
| 48 |
+
# 2. Train (requires GPU for speed, CPU works too)
|
| 49 |
+
python train.py
|
| 50 |
+
|
| 51 |
+
# 3. The model will print generated Shakespeare-style text at the end!
|
| 52 |
+
```
|
| 53 |
+
|
| 54 |
+
## Training Details
|
| 55 |
+
|
| 56 |
+
| Hyperparameter | Value |
|
| 57 |
+
|---------------|-------|
|
| 58 |
+
| Layers | 6 |
|
| 59 |
+
| Heads | 6 |
|
| 60 |
+
| Embedding dim | 384 |
|
| 61 |
+
| Context length | 256 |
|
| 62 |
+
| Batch size | 64 |
|
| 63 |
+
| Training steps | 5,000 |
|
| 64 |
+
| Optimizer | AdamW (β₁=0.9, β₂=0.95) |
|
| 65 |
+
| Learning rate | 1e-3 (cosine decay to 1e-4) |
|
| 66 |
+
| Warmup | 200 steps |
|
| 67 |
+
| Gradient clip | 1.0 |
|
| 68 |
+
|
| 69 |
+
## Acknowledgments
|
| 70 |
+
|
| 71 |
+
Based on Andrej Karpathy's legendary [build-nanogpt](https://github.com/karpathy/build-nanogpt) and [nanoGPT](https://github.com/karpathy/nanoGPT) repositories.
|