Custom GPT-2 (355M) Pre-Trained from Scratch & Instruction-Tuned via SFT

This repository hosts a custom-engineered, 355-million parameter GPT-2 style causal language model built completely from the ground up in PyTorch. The base model was pre-trained locally on OpenWebText, an open-source recreation of OpenAI's WebText dataset. This specific .pth file contains the Supervised Fine-Tuned (SFT) weights, aligned to accurately follow instructions and perform conversational tasks.

The project was built following the architectural principles outlined in Sebastian Raschka's "Build a Large Language Model (From Scratch)".


Model Details

  • Developed by: Carlos Garcia
  • Model Type: Causal Language Model (Transformer Architecture) with Instruction Fine-Tuning (SFT)
  • Language: English (en)
  • Parameters: 355M (Standard GPT-2 Medium scaling footprint)
  • Context Length: 1024 tokens
  • Date of Alignment: June 18, 2026

Architectural Dimensions

Component Specification
Layers 24
Attention Heads 16
Embedding Dimension 1024
Vocabulary Size 50,257 (tiktoken GPT-2 BPE)
Query-Key-Value Bias Disabled (False)

Intended Use

  • Primary Use: Educational experimentation, conversational AI research, and local instruction-following workflows.
  • Generation Style: Aligned to synthesize responsive, helpful text output to clear instruction prompts. It requires inputs explicitly structured with an instruction-response delimiter frame to perform reliably.

Training Data & Methodology

The model's development cycle consisted of two major phases:

1. Base Pre-Training

The underlying base model architecture was pre-trained completely from scratch on OpenWebText, an open-source replica of the Reddit-extracted outbound link text corpus originally utilized by OpenAI.

2. Instruction Tuning (SFT)

The model underwent Supervised Fine-Tuning utilizing the Alpaca dataset through 3 epochs.


Training Hyperparameters

Fine-tuning was executed locally using an optimized deep-learning workstation running a single NVIDIA GeForce RTX 5090.

  • Optimizer: AdamW
  • Weight Decay: 0.1
  • Learning Rate: 0.00005 ($5 \times 10^{-5}$)
  • Batch Size: 8
  • Epochs: 3
  • Hardware Setup: Single-node local training (RTX 5090)

How to Load and Run Inference Locally

Because this model was compiled from native, custom PyTorch source code rather than the Hugging Face transformers library wrappers, you must load the saved .pth state dictionary directly back into your custom script definition matching the architecture settings:

import torch
from p02_gpt_model import GPTModel, GPT_CONFIG_355M

# 1. Initialize custom model configuration
model = GPTModel(GPT_CONFIG_355M)

# 2. Map the state dictionary weights
MODEL_PATH = "../models/gsp-2/gsp2_355m_sft.pth"
model_state_dict = torch.load(MODEL_PATH, map_location="cpu", weights_only=True)
model.load_state_dict(model_state_dict)

model.eval()
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Datasets used to train cgarciams/gsp2_355m_sft

Collection including cgarciams/gsp2_355m_sft