GPT-2 From Scratch (124M Parameters)
A GPT-2 Small model (124M parameters) trained entirely from scratch on a single NVIDIA Tesla T4 GPU.
Model Details
| Property | Value |
|---|---|
| Architecture | GPT-2 Small (Pre-LayerNorm) |
| Parameters | ~124M |
| Layers | 12 |
| Attention Heads | 12 |
| Hidden Dimension | 768 |
| Max Sequence Length | 1024 |
| Vocabulary | GPT-2 BPE (50,257 tokens) |
Training
Pre-training
- Dataset: OpenWebText (~2B tokens)
- Hardware: Single NVIDIA Tesla T4 (16GB VRAM)
- Precision: Mixed FP16
- Optimizer: AdamW (lr=6e-4, cosine decay)
- Batch Size: 64 (8 micro-batch x 8 gradient accumulation)
Fine-tuning
- Dataset: OpenAssistant/oasst1 (multi-turn conversations)
- Objective: Causal LM with masked loss (only on assistant responses)
- Final Val Loss: 2.7299960446357727
- Final Step: 21576
Usage
# Note: This model uses a custom architecture.
# Load with the original training code for best results.
import torch
from model import GPT2
from config import ModelConfig
config = ModelConfig()
model = GPT2(config)
checkpoint = torch.load("pytorch_model.bin", map_location="cpu")
model.load_state_dict(checkpoint)
Built From Scratch
Every component was implemented from zero in PyTorch:
- Multi-head causal self-attention
- Feed-forward networks with GELU
- Pre-LayerNorm transformer blocks
- Positional and token embeddings
- Weight tying between embedding and output head
- Full training loop with mixed precision, gradient accumulation, checkpointing
Limitations
This is a learning project. The model is small (124M params) and trained on limited data compared to production models. It can hold basic conversations but will not match the quality of larger models.
License
MIT
- Downloads last month
- 2,578