GPT-2 From Scratch (124M Parameters)

A GPT-2 Small model (124M parameters) trained entirely from scratch on a single NVIDIA Tesla T4 GPU.

Model Details

Property Value
Architecture GPT-2 Small (Pre-LayerNorm)
Parameters ~124M
Layers 12
Attention Heads 12
Hidden Dimension 768
Max Sequence Length 1024
Vocabulary GPT-2 BPE (50,257 tokens)

Training

Pre-training

  • Dataset: OpenWebText (~2B tokens)
  • Hardware: Single NVIDIA Tesla T4 (16GB VRAM)
  • Precision: Mixed FP16
  • Optimizer: AdamW (lr=6e-4, cosine decay)
  • Batch Size: 64 (8 micro-batch x 8 gradient accumulation)

Fine-tuning

  • Dataset: OpenAssistant/oasst1 (multi-turn conversations)
  • Objective: Causal LM with masked loss (only on assistant responses)
  • Final Val Loss: 2.7299960446357727
  • Final Step: 21576

Usage

# Note: This model uses a custom architecture.
# Load with the original training code for best results.
import torch
from model import GPT2
from config import ModelConfig

config = ModelConfig()
model = GPT2(config)
checkpoint = torch.load("pytorch_model.bin", map_location="cpu")
model.load_state_dict(checkpoint)

Built From Scratch

Every component was implemented from zero in PyTorch:

  • Multi-head causal self-attention
  • Feed-forward networks with GELU
  • Pre-LayerNorm transformer blocks
  • Positional and token embeddings
  • Weight tying between embedding and output head
  • Full training loop with mixed precision, gradient accumulation, checkpointing

Limitations

This is a learning project. The model is small (124M params) and trained on limited data compared to production models. It can hold basic conversations but will not match the quality of larger models.

License

MIT

Downloads last month
2,578
Safetensors
Model size
0.1B params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support