GPT-2 From Scratch (124M Parameters)

A GPT-2 Small model (124M parameters) trained entirely from scratch on a single NVIDIA Tesla T4 GPU.

Model Details

Property	Value
Architecture	GPT-2 Small (Pre-LayerNorm)
Parameters	~124M
Layers	12
Attention Heads	12
Hidden Dimension	768
Max Sequence Length	1024
Vocabulary	GPT-2 BPE (50,257 tokens)

Training

Pre-training

Dataset: OpenWebText (~2B tokens)
Hardware: Single NVIDIA Tesla T4 (16GB VRAM)
Precision: Mixed FP16
Optimizer: AdamW (lr=6e-4, cosine decay)
Batch Size: 64 (8 micro-batch x 8 gradient accumulation)

Fine-tuning

Dataset: OpenAssistant/oasst1 (multi-turn conversations)
Objective: Causal LM with masked loss (only on assistant responses)
Final Val Loss: 2.7299960446357727
Final Step: 21576

Usage

# Note: This model uses a custom architecture.
# Load with the original training code for best results.
import torch
from model import GPT2
from config import ModelConfig

config = ModelConfig()
model = GPT2(config)
checkpoint = torch.load("pytorch_model.bin", map_location="cpu")
model.load_state_dict(checkpoint)

Built From Scratch

Every component was implemented from zero in PyTorch:

Multi-head causal self-attention
Feed-forward networks with GELU
Pre-LayerNorm transformer blocks
Positional and token embeddings
Weight tying between embedding and output head
Full training loop with mixed precision, gradient accumulation, checkpointing

Limitations

This is a learning project. The model is small (124M params) and trained on limited data compared to production models. It can hold basic conversations but will not match the quality of larger models.

License

MIT

Downloads last month: 2,578

Safetensors

Model size

0.1B params

Tensor type

F32