yat343 commited on
Commit
82cb4ef
·
verified ·
1 Parent(s): a77e2b2

Upload README.md

Browse files
Files changed (1) hide show
  1. README.md +71 -0
README.md ADDED
@@ -0,0 +1,71 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # nanoGPT Tutorial
2
+
3
+ A step-by-step implementation of a tiny GPT model from scratch in pure PyTorch.
4
+
5
+ ## What is this?
6
+
7
+ This repository contains a complete, tutorial-style implementation of a small GPT (Generative Pre-trained Transformer) trained on tiny Shakespeare. Every line of code is commented to explain **what** it does and **why**.
8
+
9
+ ## Files
10
+
11
+ | File | Purpose |
12
+ |------|---------|
13
+ | `model.py` | The full GPT architecture: CausalSelfAttention, MLP, Block, GPT |
14
+ | `prepare.py` | Data preparation: character-level tokenization, train/val split |
15
+ | `train.py` | Training loop with AdamW, cosine LR schedule, and generation |
16
+ | `input.txt` | The tiny Shakespeare dataset (~1.1M characters, 65 unique chars) |
17
+ | `data.pt` | Preprocessed tensors (generated by `prepare.py`) |
18
+ | `best.pt` | Best model checkpoint (generated by `train.py`) |
19
+
20
+ ## Model Architecture
21
+
22
+ ```
23
+ GPT(
24
+ wte (Embedding): vocab_size -> n_embd (token embeddings)
25
+ wpe (Embedding): block_size -> n_embd (position embeddings)
26
+ h (6x Block):
27
+ ln_1 (LayerNorm)
28
+ attn (CausalSelfAttention: multi-head self-attention with causal mask)
29
+ ln_2 (LayerNorm)
30
+ mlp (MLP: expand 4x -> GELU -> project back)
31
+ ln_f (LayerNorm)
32
+ lm_head (Linear): n_embd -> vocab_size (next-token prediction)
33
+ )
34
+ ```
35
+
36
+ **Key design choices:**
37
+ - **Character-level vocabulary** — no tokenizer library needed
38
+ - **Pre-LayerNorm** residuals — standard in modern transformers
39
+ - **Weight tying** — shared weights between input embedding and output projection
40
+ - **Causal (autoregressive) attention** — can only attend to past tokens
41
+
42
+ ## How to Run
43
+
44
+ ```bash
45
+ # 1. Prepare data
46
+ python prepare.py
47
+
48
+ # 2. Train (requires GPU for speed, CPU works too)
49
+ python train.py
50
+
51
+ # 3. The model will print generated Shakespeare-style text at the end!
52
+ ```
53
+
54
+ ## Training Details
55
+
56
+ | Hyperparameter | Value |
57
+ |---------------|-------|
58
+ | Layers | 6 |
59
+ | Heads | 6 |
60
+ | Embedding dim | 384 |
61
+ | Context length | 256 |
62
+ | Batch size | 64 |
63
+ | Training steps | 5,000 |
64
+ | Optimizer | AdamW (β₁=0.9, β₂=0.95) |
65
+ | Learning rate | 1e-3 (cosine decay to 1e-4) |
66
+ | Warmup | 200 steps |
67
+ | Gradient clip | 1.0 |
68
+
69
+ ## Acknowledgments
70
+
71
+ Based on Andrej Karpathy's legendary [build-nanogpt](https://github.com/karpathy/build-nanogpt) and [nanoGPT](https://github.com/karpathy/nanoGPT) repositories.