Custom GPT-2 (355M) Pre-Trained from Scratch & Instruction-Tuned via SFT
This repository hosts a custom-engineered, 355-million parameter GPT-2 style causal language model built completely from the ground up in PyTorch. The base model was pre-trained locally on OpenWebText, an open-source recreation of OpenAI's WebText dataset. This specific .pth file contains the Supervised Fine-Tuned (SFT) weights, aligned to accurately follow instructions and perform conversational tasks.
The project was built following the architectural principles outlined in Sebastian Raschka's "Build a Large Language Model (From Scratch)".
Model Details
- Developed by: Carlos Garcia
- Model Type: Causal Language Model (Transformer Architecture) with Instruction Fine-Tuning (SFT)
- Language: English (
en) - Parameters: 355M (Standard GPT-2 Medium scaling footprint)
- Context Length: 1024 tokens
- Date of Alignment: June 18, 2026
Architectural Dimensions
| Component | Specification |
|---|---|
| Layers | 24 |
| Attention Heads | 16 |
| Embedding Dimension | 1024 |
| Vocabulary Size | 50,257 (tiktoken GPT-2 BPE) |
| Query-Key-Value Bias | Disabled (False) |
Intended Use
- Primary Use: Educational experimentation, conversational AI research, and local instruction-following workflows.
- Generation Style: Aligned to synthesize responsive, helpful text output to clear instruction prompts. It requires inputs explicitly structured with an instruction-response delimiter frame to perform reliably.
Training Data & Methodology
The model's development cycle consisted of two major phases:
1. Base Pre-Training
The underlying base model architecture was pre-trained completely from scratch on OpenWebText, an open-source replica of the Reddit-extracted outbound link text corpus originally utilized by OpenAI.
2. Instruction Tuning (SFT)
The model underwent Supervised Fine-Tuning utilizing the Alpaca dataset through 3 epochs.
Training Hyperparameters
Fine-tuning was executed locally using an optimized deep-learning workstation running a single NVIDIA GeForce RTX 5090.
- Optimizer:
AdamW - Weight Decay: 0.1
- Learning Rate: 0.00005 ($5 \times 10^{-5}$)
- Batch Size: 8
- Epochs: 3
- Hardware Setup: Single-node local training (RTX 5090)
How to Load and Run Inference Locally
Because this model was compiled from native, custom PyTorch source code rather than the Hugging Face transformers library wrappers, you must load the saved .pth state dictionary directly back into your custom script definition matching the architecture settings:
import torch
from p02_gpt_model import GPTModel, GPT_CONFIG_355M
# 1. Initialize custom model configuration
model = GPTModel(GPT_CONFIG_355M)
# 2. Map the state dictionary weights
MODEL_PATH = "../models/gsp-2/gsp2_355m_sft.pth"
model_state_dict = torch.load(MODEL_PATH, map_location="cpu", weights_only=True)
model.load_state_dict(model_state_dict)
model.eval()