slimGPT β€” 124M Parameter GPT-Style Language Model

slimGPT is a 124-million-parameter autoregressive language model built from scratch using a clean, modular PyTorch codebase. It follows the GPT-2 small architecture and was trained entirely on consumer-accessible hardware, demonstrating that capable language model training is achievable without large-scale infrastructure.


Model Details

Property Value
Architecture GPT-2 style (decoder-only Transformer)
Parameters ~124 million
Layers 12
Attention Heads 12
Embedding Dim 768
Context Length 1024 tokens
Vocabulary GPT-2 BPE tokenizer (50,257 tokens)
Training Iters 5,000
Best Val Loss 3.3079
License MIT

Training Infrastructure

The model was trained on a single-GPU cloud instance with the following specifications:

Component Specification
OS Debian GNU/Linux 12 (Bookworm)
CPU Intel Xeon @ 2.20 GHz (4 vCPUs, 2 physical cores, 2 threads/core)
RAM 16 GiB
Storage 60 GB NVMe
GPU NVIDIA L4
VRAM 24 GB
NVIDIA Driver 550.54.15

Training was completed without any distributed setup, A single NVIDIA L4 GPU was sufficient for the full training run.


Architecture Overview

slimGPT follows the standard GPT-2 decoder-only Transformer architecture:

  • Token + positional embeddings β€” learned embeddings over the GPT-2 BPE vocabulary with 1024-token positional encodings
  • 12 Transformer blocks β€” each with multi-head causal self-attention (12 heads) and a position-wise feed-forward network
  • Pre-norm design β€” LayerNorm applied before attention and MLP sub-layers
  • Weight tying β€” input embedding and output projection weights are tied
  • Causal masking β€” autoregressive, left-to-right generation

Usage

from transformers import AutoModelForCausalLM, AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("samueljayasingh/slimGPT")
model = AutoModelForCausalLM.from_pretrained("samueljayasingh/slimGPT")

ids = tokenizer("The meaning of life is", return_tensors="pt").input_ids
output = model.generate(ids, max_new_tokens=100, do_sample=True, temperature=0.8, top_p=0.9)
print(tokenizer.decode(output[0], skip_special_tokens=True))

Pipeline API

from transformers import pipeline

generator = pipeline("text-generation", model="samueljayasingh/slimGPT")
result = generator("Once upon a time,", max_new_tokens=80, do_sample=True)
print(result[0]["generated_text"])

Serving with vLLM

pip install vllm
vllm serve "samueljayasingh/slimGPT"

curl -X POST "http://localhost:8000/v1/completions" \
  -H "Content-Type: application/json" \
  --data '{
    "model": "samueljayasingh/slimGPT",
    "prompt": "The future of AI is",
    "max_tokens": 100,
    "temperature": 0.7
  }'

Intended Use

This model is intended for:

  • Research and experimentation β€” studying language model behavior, attention patterns, and generation dynamics at the 124M scale
  • Educational purposes β€” understanding GPT-style architectures by working with a fully transparent, from-scratch implementation
  • Prototyping β€” lightweight text generation for downstream tasks, fine-tuning experiments, or benchmarking

Out-of-Scope Use

  • Production or safety-critical applications
  • Tasks requiring factual accuracy or up-to-date knowledge
  • Any use that relies on instruction-following or alignment β€” this is a base language model with no RLHF or instruction tuning

Limitations

  • Trained for only 5,000 iterations β€” the model is capable of coherent text continuation but has not converged to the quality of fully trained GPT-2
  • No fine-tuning or alignment β€” outputs are raw continuations and may be incoherent, biased, or off-topic
  • English-only β€” trained on English text; performance on other languages is not evaluated
  • Context window of 1024 tokens β€” longer documents are truncated

Training Details

The model was trained using a clean, readable PyTorch implementation with the following highlights:

  • Optimizer: AdamW with cosine learning rate decay and linear warmup
  • Tokenizer: GPT-2 BPE (via tiktoken)
  • Data: OpenWebText-style dataset sampled in token chunks of length 1024
  • Mixed precision: torch.autocast with bfloat16 on the NVIDIA L4 GPU
  • Gradient clipping: Applied to stabilize training
  • Checkpointing: Best model saved based on validation loss

Training Runtime

  • Hardware: NVIDIA L4 (24 GB VRAM), 4 vCPUs, 16 GB RAM
  • Training iterations: 5,000
  • Total training time: ~18 hours
  • Average time per iteration: ~13 seconds

Evaluation

Metric Value
Best Val Loss 3.3079
Training Iters 5,000

Perplexity can be approximated as exp(3.3079) β‰ˆ 27.3. For reference, a fully trained GPT-2 small achieves a perplexity of roughly 18–22 on OpenWebText; slimGPT sits in a reasonable range for its training budget.

Eval Summary

Training loss

Loss Curve

Perplexity comparison

Perplexity


Citation

If you use this model in your work, please credit:

@misc{slimgpt2026,
  author       = {Samuel Jayasingh},
  title        = {slimGPT: A 124M GPT-2-style language model trained from scratch},
  year         = {2026},
  publisher    = {Hugging Face},
  howpublished = {\url{https://huggingface.co/samueljayasingh/slimGPT}}
}

Credits

Inspired by Andrej Karpathy's "Let's reproduce GPT-2 (124M)" tutorial: https://www.youtube.com/watch?v=l8pRSuU81PU Special thanks to Andrej Karpathy for making modern LLM training and implementation accessible through open educational content.


License

This model is released under the MIT License.

Downloads last month
78
Safetensors
Model size
0.1B params
Tensor type
F32
Β·
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support