slimGPT — 124M Parameter GPT-Style Language Model

slimGPT is a 124-million-parameter autoregressive language model built from scratch using a clean, modular PyTorch codebase. It follows the GPT-2 small architecture and was trained entirely on consumer-accessible hardware, demonstrating that capable language model training is achievable without large-scale infrastructure.

Model Details

Property	Value
Architecture	GPT-2 style (decoder-only Transformer)
Parameters	~124 million
Layers	12
Attention Heads	12
Embedding Dim	768
Context Length	1024 tokens
Vocabulary	GPT-2 BPE tokenizer (50,257 tokens)
Training Iters	5,000
Best Val Loss	3.3079
License	MIT

Training Infrastructure

The model was trained on a single-GPU cloud instance with the following specifications:

Component	Specification
OS	Debian GNU/Linux 12 (Bookworm)
CPU	Intel Xeon @ 2.20 GHz (4 vCPUs, 2 physical cores, 2 threads/core)
RAM	16 GiB
Storage	60 GB NVMe
GPU	NVIDIA L4
VRAM	24 GB
NVIDIA Driver	550.54.15

Training was completed without any distributed setup, A single NVIDIA L4 GPU was sufficient for the full training run.

Architecture Overview

slimGPT follows the standard GPT-2 decoder-only Transformer architecture:

Token + positional embeddings — learned embeddings over the GPT-2 BPE vocabulary with 1024-token positional encodings
12 Transformer blocks — each with multi-head causal self-attention (12 heads) and a position-wise feed-forward network
Pre-norm design — LayerNorm applied before attention and MLP sub-layers
Weight tying — input embedding and output projection weights are tied
Causal masking — autoregressive, left-to-right generation

Usage

from transformers import AutoModelForCausalLM, AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("samueljayasingh/slimGPT")
model = AutoModelForCausalLM.from_pretrained("samueljayasingh/slimGPT")

ids = tokenizer("The meaning of life is", return_tensors="pt").input_ids
output = model.generate(ids, max_new_tokens=100, do_sample=True, temperature=0.8, top_p=0.9)
print(tokenizer.decode(output[0], skip_special_tokens=True))

Pipeline API

from transformers import pipeline

generator = pipeline("text-generation", model="samueljayasingh/slimGPT")
result = generator("Once upon a time,", max_new_tokens=80, do_sample=True)
print(result[0]["generated_text"])

Serving with vLLM

pip install vllm
vllm serve "samueljayasingh/slimGPT"

curl -X POST "http://localhost:8000/v1/completions" \
  -H "Content-Type: application/json" \
  --data '{
    "model": "samueljayasingh/slimGPT",
    "prompt": "The future of AI is",
    "max_tokens": 100,
    "temperature": 0.7
  }'

Intended Use

This model is intended for:

Research and experimentation — studying language model behavior, attention patterns, and generation dynamics at the 124M scale
Educational purposes — understanding GPT-style architectures by working with a fully transparent, from-scratch implementation
Prototyping — lightweight text generation for downstream tasks, fine-tuning experiments, or benchmarking

Out-of-Scope Use

Production or safety-critical applications
Tasks requiring factual accuracy or up-to-date knowledge
Any use that relies on instruction-following or alignment — this is a base language model with no RLHF or instruction tuning

Limitations

Trained for only 5,000 iterations — the model is capable of coherent text continuation but has not converged to the quality of fully trained GPT-2
No fine-tuning or alignment — outputs are raw continuations and may be incoherent, biased, or off-topic
English-only — trained on English text; performance on other languages is not evaluated
Context window of 1024 tokens — longer documents are truncated

Training Details

The model was trained using a clean, readable PyTorch implementation with the following highlights:

Optimizer: AdamW with cosine learning rate decay and linear warmup
Tokenizer: GPT-2 BPE (via tiktoken)
Data: OpenWebText-style dataset sampled in token chunks of length 1024
Mixed precision: torch.autocast with bfloat16 on the NVIDIA L4 GPU
Gradient clipping: Applied to stabilize training
Checkpointing: Best model saved based on validation loss

Training Runtime

Hardware: NVIDIA L4 (24 GB VRAM), 4 vCPUs, 16 GB RAM
Training iterations: 5,000
Total training time: ~18 hours
Average time per iteration: ~13 seconds

Evaluation

Metric	Value
Best Val Loss	3.3079
Training Iters	5,000

Perplexity can be approximated as exp(3.3079) ≈ 27.3. For reference, a fully trained GPT-2 small achieves a perplexity of roughly 18–22 on OpenWebText; slimGPT sits in a reasonable range for its training budget.

Training loss

Perplexity comparison

Citation

If you use this model in your work, please credit:

@misc{slimgpt2026,
  author       = {Samuel Jayasingh},
  title        = {slimGPT: A 124M GPT-2-style language model trained from scratch},
  year         = {2026},
  publisher    = {Hugging Face},
  howpublished = {\url{https://huggingface.co/samueljayasingh/slimGPT}}
}

Credits

Inspired by Andrej Karpathy's "Let's reproduce GPT-2 (124M)" tutorial: https://www.youtube.com/watch?v=l8pRSuU81PU Special thanks to Andrej Karpathy for making modern LLM training and implementation accessible through open educational content.

License

This model is released under the MIT License.

Downloads last month: 78

Safetensors

Model size

0.1B params

Tensor type

F32