Summarization
PyTorch
English
rlhf
ppo
text-generation
from-scratch

GPT-2 (124M) From Scratch to RLHF Alignment

This repository contains the model weights for a custom GPT-2 (124M) model trained entirely from scratch, then aligned using Supervised Fine-Tuning (SFT) and Reinforcement Learning from Human Feedback (RLHF) with Proximal Policy Optimization (PPO).

Model Details

  • Base Architecture: GPT-2 (124 Million Parameters)
  • Training Framework: Custom PyTorch implementation (inspired by Andrej Karpathy's Let's Build GPT tutorial).
  • Language: English (en)
  • Task: Summarization (summarization)

Training Pipeline

The model was developed through a complete, three-stage modern alignment pipeline, trained locally on a dual T4 GPU environment.

1. Pretraining

The raw base model was trained from a random initialization on the FineWeb-Edu dataset.

  • Tokens Trained: ~10 Billion (1 full epoch)
  • Final Validation Loss: 3.0048
  • HellaSwag Accuracy: 28.95% (Note: capable of ~33% with longer training).

2. Supervised Fine-Tuning (SFT)

To teach the model how to summarize, it was fine-tuned on the OpenAI Summarize TL;DR dataset.

  • Final Validation Loss: 2.5321 (at step 3400)
  • Checkpoint Included: best_sft.pt

3. Reinforcement Learning (RLHF / PPO)

A Reward Model was trained on human preference data using OpenAI Summarize Comparisons. The SFT model was then fine-tuned using PPO to maximize the reward signal while penalizing KL-divergence from the reference model to prevent the language from degrading.

  • Checkpoint Included: ppo_latest.pt

Eval Results

The model has been evaluated qualitatively against its SFT baseline. PPO alignment successfully prevents the model from hallucinating or copying text verbatim, resulting in highly abstractive and concise summaries.

Qualitative Example

Input Post: I have a roommate who keeps eating my food without asking. Every single time I buy groceries, half of it disappears within 48 hours. I tried talking to him politely but he just laughs it off and says he will replace it, but he never does. Should I get a mini-fridge for my room or confront him one last time more aggressively?

PPO Summary: I have a roommate who keeps eating my groceries without asking, and I don't want him to do it again. Should I confront him?

Usage & Inference

Because this model was built from scratch without relying on transformers high-level abstraction, it requires a custom inference loop. The weights are provided in raw PyTorch .pt format.

You can interact with the live inference API here: GPT-2 Summarizer App

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for popboat1/gpt2-summarizer-models

Finetuned
(2187)
this model

Datasets used to train popboat1/gpt2-summarizer-models

Space using popboat1/gpt2-summarizer-models 1