Model Card for GPT-124M

Overview

GPT-124M is a decoder-only transformer model based on OpenAI’s GPT-2 architecture. It is trained for text generation and other natural language processing (NLP) tasks. The model is designed for general-purpose language modeling, making it useful for applications such as text completion.

  • Library: 🤗 transformers
  • License: MIT
  • Datasets: HuggingFaceFW/fineweb-edu
  • Language: English
  • Base Model: openai-community/gpt2
  • Pipeline Tag: text-generation
  • Developer: Samkeet Sangai
  • Funded By: Samkeet Sangai
  • Shared By: Samkeet Sangai
  • Model Type: GPT Decoder-Only

Model Sources

Model Details

Model Description

GPT-124M is a lightweight generative language model fine-tuned on the fineweb-edu dataset. It can generate coherent and contextually relevant text but is not fine-tuned for instruction-following, safety, or factual accuracy.

Training Configuration

  • Block Size: 1024
  • Vocabulary Size: 50304
  • Number of Layers: 12
  • Number of Attention Heads: 12
  • Embedding Size: 768
  • Hardware: 8x NVIDIA RTX 4090 GPUs
  • Training Duration: 13 hours
  • Dataset: fineweb-edu (10 billion tokens)
  • Training Date: January 2025
  • Validation Dataset: 100 million tokens of HuggingFaceFW/fineweb-edu

Usage

You can use this model for text generation using the transformers library.

Method 1: Using Pipeline

# Import necessary modules from transformers
from transformers import pipeline, AutoModelForCausalLM, AutoTokenizer

# Load tokenizer and model
model_name = "samkeet/GPT_124M"
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)

# Create text generation pipeline
pipe = pipeline("text-generation", model=model_name, tokenizer=tokenizer, trust_remote_code=True, device="cpu")

# Generate text
result = pipe("Earth revolves around the", do_sample=True, max_length=40, temperature=0.9, top_p=0.5, top_k=50)
print("Pipeline Output:", result)

Method 1: Direct Generation

# Import necessary libraries
import torch

# Function for direct tokenization and text generation
def generate_text(input_text, device='cpu'):
    tokens = tokenizer.encode(input_text, return_tensors='pt').to(device)
    model.to(device)
    
    # Generate output
    output = model.generate(
        tokens, 
        do_sample=True, 
        max_length=40, 
        temperature=0.9,
        top_p=0.5,
        top_k=50,
    )
    
    # Decode generated text
    generated_sentence = tokenizer.decode(output)
    return generated_sentence

# Generate text
input_text = "Earth revolves around the"
print("Direct Output:", generate_text(input_text))

Fine-tuning & Downstream Use

This model can be fine-tuned for specific NLP applications like:

  • Dialogue generation
  • Text summarization
  • Creative writing
  • Code generation

Limitations & Risks

Out-of-Scope Use

  • The model is not instruction-tuned for safety, ethics, or factual accuracy.
  • It may produce biased, misleading, or unsafe outputs.
  • It should not be used for tasks requiring high reliability, such as medical, legal, or financial applications.

Bias, Risks, and Limitations

  • The dataset may contain biases present in public web data.
  • The model does not filter or detect offensive content.
  • The model may hallucinate incorrect facts.

Recommendations

  • Always verify generated content before use.
  • Implement content filtering mechanisms for deployment.
  • Use in supervised environments only.

Evaluation

Training & Validation Loss

Validation was conducted using 100 million tokens from the HuggingFaceFW/fineweb-edu dataset. The training and validation loss graph indicates a stable convergence with minimal overfitting. The training loss achieved a minimum value of 2.88, while the validation loss stabilized at 2.97. image/png

Results

The model was benchmarked against OpenAI’s GPT-2 Small and GPT-3 Small (both ~124M parameters). Remarkably, despite being trained on only 10 billion tokens, compared to GPT-3 Small's 300 billion tokens, GPT-124M was able to outperform both models in HellaSwag evaluation. This performance advantage is attributed to the specialized training data (educational content), which contrasts with GPT-3 Small’s broader multilingual and multi-domain training data.

According to Chinchilla’s scaling laws, an optimal token-to-parameter ratio suggests that a 124M-parameter model ideally requires 2.48 billion tokens for training. The excess training tokens used in GPT-3 Small might have led to diminishing returns in performance. image/png

Key Insights from Evaluation

  • Efficient Training: The model demonstrates impressive performance relative to its training token count, suggesting an efficient use of resources due to training using the Distributed Data Parallel (DDP) technique.
  • Data-Specific Advantage: Training exclusively on educational data may have given GPT-124M an edge in evaluation metrics like HellaSwag.
  • Scaling Considerations: GPT-3 Small, despite being trained on 300B tokens, does not exhibit proportionally better performance due to scaling limitations.

Environmental Impact

  • Hardware Used: 8x NVIDIA RTX 4090 GPUs
  • Training Time: 13 hours -> 104 GPU hours
  • Estimated Carbon Emissions: 13.48 kg CO2 eq.
  • Equivalent to:
    • 54.5 km driven by an average ICE car
    • 6.75 kg of coal burned
    • 0.22 tree seedlings sequestering carbon for 10 years

Technical Specifications

Model Architecture

GPT-124M follows the architecture of OpenAI's GPT-2, which consists of:

  • Transformer-based decoder model
  • Self-attention mechanism
  • Layer normalization & feed-forward networks

Compute Infrastructure

  • Hardware: 8x NVIDIA RTX 4090 GPUs
  • Software: PyTorch, Hugging Face Transformers
  • Precision: FP32

Citation

If you use this model, please cite:

@article{gpt124m,
  title={GPT-124M: A Compact Transformer Model for NLP},
  author={Samkeet Sangai},
  year={2024},
  url={https://huggingface.co/samkeet/GPT_124M}
}

Contact

For inquiries, contact Samkeet Sangai.

Downloads last month
375
Safetensors
Model size
124M params
Tensor type
F32
·
Inference Providers NEW
This model is not currently available via any of the supported third-party Inference Providers, and HF Inference API does not yet support model repos that contain custom code.

Model tree for samkeet/GPT_124M

Finetuned
(1382)
this model

Dataset used to train samkeet/GPT_124M