yat343
/

nanogpt-tutorial

Model card Files Files and versions

yat343 commited on 6 days ago

Commit

82cb4ef

·

verified ·

1 Parent(s): a77e2b2

Upload README.md

Files changed (1) hide show

README.md +71 -0

README.md ADDED Viewed

	@@ -0,0 +1,71 @@

+# nanoGPT Tutorial
+A step-by-step implementation of a tiny GPT model from scratch in pure PyTorch.
+## What is this?
+This repository contains a complete, tutorial-style implementation of a small GPT (Generative Pre-trained Transformer) trained on tiny Shakespeare. Every line of code is commented to explain **what** it does and **why**.
+## Files
+| File | Purpose |
+|------|---------|
+| `model.py` | The full GPT architecture: CausalSelfAttention, MLP, Block, GPT |
+| `prepare.py` | Data preparation: character-level tokenization, train/val split |
+| `train.py` | Training loop with AdamW, cosine LR schedule, and generation |
+| `input.txt` | The tiny Shakespeare dataset (~1.1M characters, 65 unique chars) |
+| `data.pt` | Preprocessed tensors (generated by `prepare.py`) |
+| `best.pt` | Best model checkpoint (generated by `train.py`) |
+## Model Architecture
+```
+GPT(
+  wte (Embedding):      vocab_size -> n_embd  (token embeddings)
+  wpe (Embedding):      block_size -> n_embd  (position embeddings)
+  h   (6x Block):
+    ln_1 (LayerNorm)
+    attn (CausalSelfAttention: multi-head self-attention with causal mask)
+    ln_2 (LayerNorm)
+    mlp  (MLP: expand 4x -> GELU -> project back)
+  ln_f (LayerNorm)
+  lm_head (Linear):     n_embd -> vocab_size  (next-token prediction)
+)
+```
+**Key design choices:**
+- **Character-level vocabulary** — no tokenizer library needed
+- **Pre-LayerNorm** residuals — standard in modern transformers
+- **Weight tying** — shared weights between input embedding and output projection
+- **Causal (autoregressive) attention** — can only attend to past tokens
+## How to Run
+```bash
+# 1. Prepare data
+python prepare.py
+# 2. Train (requires GPU for speed, CPU works too)
+python train.py
+# 3. The model will print generated Shakespeare-style text at the end!
+```
+## Training Details
+| Hyperparameter | Value |
+|---------------|-------|
+| Layers | 6 |
+| Heads | 6 |
+| Embedding dim | 384 |
+| Context length | 256 |
+| Batch size | 64 |
+| Training steps | 5,000 |
+| Optimizer | AdamW (β₁=0.9, β₂=0.95) |
+| Learning rate | 1e-3 (cosine decay to 1e-4) |
+| Warmup | 200 steps |
+| Gradient clip | 1.0 |
+## Acknowledgments
+Based on Andrej Karpathy's legendary [build-nanogpt](https://github.com/karpathy/build-nanogpt) and [nanoGPT](https://github.com/karpathy/nanoGPT) repositories.