Model Card for GPT-124M
Overview
GPT-124M is a decoder-only transformer model based on OpenAI’s GPT-2 architecture. It is trained for text generation and other natural language processing (NLP) tasks. The model is designed for general-purpose language modeling, making it useful for applications such as text completion.
- Library: 🤗
transformers
- License: MIT
- Datasets:
HuggingFaceFW/fineweb-edu
- Language: English
- Base Model:
openai-community/gpt2
- Pipeline Tag:
text-generation
- Developer: Samkeet Sangai
- Funded By: Samkeet Sangai
- Shared By: Samkeet Sangai
- Model Type: GPT Decoder-Only
Model Sources
- Paper: Language Models are Unsupervised Multitask Learners
- Paper: Language Models are Few-Shot Learners
- Paper: Training Compute-Optimal Large Language Models
- Video: Andrej Karpathy-Let's reproduce GPT-2 (124M)
- Demo: GPT 124M Demo
- GitHub: SamkeetSangai/GPT_124M
Model Details
Model Description
GPT-124M is a lightweight generative language model fine-tuned on the fineweb-edu
dataset. It can generate coherent and contextually relevant text but is not fine-tuned for instruction-following, safety, or factual accuracy.
Training Configuration
- Block Size:
1024
- Vocabulary Size:
50304
- Number of Layers:
12
- Number of Attention Heads:
12
- Embedding Size:
768
- Hardware:
8x NVIDIA RTX 4090 GPUs
- Training Duration:
13 hours
- Dataset:
fineweb-edu
(10 billion tokens) - Training Date:
January 2025
- Validation Dataset: 100 million tokens of HuggingFaceFW/fineweb-edu
Usage
You can use this model for text generation using the transformers
library.
Method 1: Using Pipeline
# Import necessary modules from transformers
from transformers import pipeline, AutoModelForCausalLM, AutoTokenizer
# Load tokenizer and model
model_name = "samkeet/GPT_124M"
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
# Create text generation pipeline
pipe = pipeline("text-generation", model=model_name, tokenizer=tokenizer, trust_remote_code=True, device="cpu")
# Generate text
result = pipe("Earth revolves around the", do_sample=True, max_length=40, temperature=0.9, top_p=0.5, top_k=50)
print("Pipeline Output:", result)
Method 1: Direct Generation
# Import necessary libraries
import torch
# Function for direct tokenization and text generation
def generate_text(input_text, device='cpu'):
tokens = tokenizer.encode(input_text, return_tensors='pt').to(device)
model.to(device)
# Generate output
output = model.generate(
tokens,
do_sample=True,
max_length=40,
temperature=0.9,
top_p=0.5,
top_k=50,
)
# Decode generated text
generated_sentence = tokenizer.decode(output)
return generated_sentence
# Generate text
input_text = "Earth revolves around the"
print("Direct Output:", generate_text(input_text))
Fine-tuning & Downstream Use
This model can be fine-tuned for specific NLP applications like:
- Dialogue generation
- Text summarization
- Creative writing
- Code generation
Limitations & Risks
Out-of-Scope Use
- The model is not instruction-tuned for safety, ethics, or factual accuracy.
- It may produce biased, misleading, or unsafe outputs.
- It should not be used for tasks requiring high reliability, such as medical, legal, or financial applications.
Bias, Risks, and Limitations
- The dataset may contain biases present in public web data.
- The model does not filter or detect offensive content.
- The model may hallucinate incorrect facts.
Recommendations
- Always verify generated content before use.
- Implement content filtering mechanisms for deployment.
- Use in supervised environments only.
Evaluation
Training & Validation Loss
Validation was conducted using 100 million tokens
from the HuggingFaceFW/fineweb-edu
dataset. The training and validation loss graph indicates a stable convergence with minimal overfitting. The training loss achieved a minimum value of 2.88, while the validation loss stabilized at 2.97.
Results
The model was benchmarked against OpenAI’s GPT-2 Small and GPT-3 Small (both ~124M parameters). Remarkably, despite being trained on only 10 billion tokens
, compared to GPT-3 Small's 300 billion tokens
, GPT-124M was able to outperform both models in HellaSwag
evaluation. This performance advantage is attributed to the specialized training data (educational content), which contrasts with GPT-3 Small’s broader multilingual and multi-domain training data.
According to Chinchilla’s scaling laws, an optimal token-to-parameter ratio suggests that a 124M-parameter model ideally requires 2.48 billion tokens
for training. The excess training tokens used in GPT-3 Small might have led to diminishing returns in performance.
Key Insights from Evaluation
- Efficient Training: The model demonstrates impressive performance relative to its training token count, suggesting an efficient use of resources due to training using the Distributed Data Parallel (DDP) technique.
- Data-Specific Advantage: Training exclusively on educational data may have given GPT-124M an edge in evaluation metrics like
HellaSwag
. - Scaling Considerations: GPT-3 Small, despite being trained on 300B tokens, does not exhibit proportionally better performance due to scaling limitations.
Environmental Impact
- Hardware Used:
8x NVIDIA RTX 4090 GPUs
- Training Time:
13 hours -> 104 GPU hours
- Estimated Carbon Emissions:
13.48 kg CO2 eq.
- Equivalent to:
54.5 km
driven by an average ICE car6.75 kg
of coal burned0.22
tree seedlings sequestering carbon for 10 years
Technical Specifications
Model Architecture
GPT-124M follows the architecture of OpenAI's GPT-2, which consists of:
- Transformer-based decoder model
- Self-attention mechanism
- Layer normalization & feed-forward networks
Compute Infrastructure
- Hardware: 8x NVIDIA RTX 4090 GPUs
- Software: PyTorch, Hugging Face Transformers
- Precision: FP32
Citation
If you use this model, please cite:
@article{gpt124m,
title={GPT-124M: A Compact Transformer Model for NLP},
author={Samkeet Sangai},
year={2024},
url={https://huggingface.co/samkeet/GPT_124M}
}
Contact
For inquiries, contact Samkeet Sangai.
- Downloads last month
- 375
Model tree for samkeet/GPT_124M
Base model
openai-community/gpt2