slimGPT β 124M Parameter GPT-Style Language Model
slimGPT is a 124-million-parameter autoregressive language model built from scratch using a clean, modular PyTorch codebase. It follows the GPT-2 small architecture and was trained entirely on consumer-accessible hardware, demonstrating that capable language model training is achievable without large-scale infrastructure.
Model Details
| Property | Value |
|---|---|
| Architecture | GPT-2 style (decoder-only Transformer) |
| Parameters | ~124 million |
| Layers | 12 |
| Attention Heads | 12 |
| Embedding Dim | 768 |
| Context Length | 1024 tokens |
| Vocabulary | GPT-2 BPE tokenizer (50,257 tokens) |
| Training Iters | 5,000 |
| Best Val Loss | 3.3079 |
| License | MIT |
Training Infrastructure
The model was trained on a single-GPU cloud instance with the following specifications:
| Component | Specification |
|---|---|
| OS | Debian GNU/Linux 12 (Bookworm) |
| CPU | Intel Xeon @ 2.20 GHz (4 vCPUs, 2 physical cores, 2 threads/core) |
| RAM | 16 GiB |
| Storage | 60 GB NVMe |
| GPU | NVIDIA L4 |
| VRAM | 24 GB |
| NVIDIA Driver | 550.54.15 |
Training was completed without any distributed setup, A single NVIDIA L4 GPU was sufficient for the full training run.
Architecture Overview
slimGPT follows the standard GPT-2 decoder-only Transformer architecture:
- Token + positional embeddings β learned embeddings over the GPT-2 BPE vocabulary with 1024-token positional encodings
- 12 Transformer blocks β each with multi-head causal self-attention (12 heads) and a position-wise feed-forward network
- Pre-norm design β LayerNorm applied before attention and MLP sub-layers
- Weight tying β input embedding and output projection weights are tied
- Causal masking β autoregressive, left-to-right generation
Usage
from transformers import AutoModelForCausalLM, AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("samueljayasingh/slimGPT")
model = AutoModelForCausalLM.from_pretrained("samueljayasingh/slimGPT")
ids = tokenizer("The meaning of life is", return_tensors="pt").input_ids
output = model.generate(ids, max_new_tokens=100, do_sample=True, temperature=0.8, top_p=0.9)
print(tokenizer.decode(output[0], skip_special_tokens=True))
Pipeline API
from transformers import pipeline
generator = pipeline("text-generation", model="samueljayasingh/slimGPT")
result = generator("Once upon a time,", max_new_tokens=80, do_sample=True)
print(result[0]["generated_text"])
Serving with vLLM
pip install vllm
vllm serve "samueljayasingh/slimGPT"
curl -X POST "http://localhost:8000/v1/completions" \
-H "Content-Type: application/json" \
--data '{
"model": "samueljayasingh/slimGPT",
"prompt": "The future of AI is",
"max_tokens": 100,
"temperature": 0.7
}'
Intended Use
This model is intended for:
- Research and experimentation β studying language model behavior, attention patterns, and generation dynamics at the 124M scale
- Educational purposes β understanding GPT-style architectures by working with a fully transparent, from-scratch implementation
- Prototyping β lightweight text generation for downstream tasks, fine-tuning experiments, or benchmarking
Out-of-Scope Use
- Production or safety-critical applications
- Tasks requiring factual accuracy or up-to-date knowledge
- Any use that relies on instruction-following or alignment β this is a base language model with no RLHF or instruction tuning
Limitations
- Trained for only 5,000 iterations β the model is capable of coherent text continuation but has not converged to the quality of fully trained GPT-2
- No fine-tuning or alignment β outputs are raw continuations and may be incoherent, biased, or off-topic
- English-only β trained on English text; performance on other languages is not evaluated
- Context window of 1024 tokens β longer documents are truncated
Training Details
The model was trained using a clean, readable PyTorch implementation with the following highlights:
- Optimizer: AdamW with cosine learning rate decay and linear warmup
- Tokenizer: GPT-2 BPE (via
tiktoken) - Data: OpenWebText-style dataset sampled in token chunks of length 1024
- Mixed precision:
torch.autocastwithbfloat16on the NVIDIA L4 GPU - Gradient clipping: Applied to stabilize training
- Checkpointing: Best model saved based on validation loss
Training Runtime
- Hardware: NVIDIA L4 (24 GB VRAM), 4 vCPUs, 16 GB RAM
- Training iterations: 5,000
- Total training time: ~18 hours
- Average time per iteration: ~13 seconds
Evaluation
| Metric | Value |
|---|---|
| Best Val Loss | 3.3079 |
| Training Iters | 5,000 |
Perplexity can be approximated as exp(3.3079) β 27.3. For reference, a fully trained GPT-2 small achieves a perplexity of roughly 18β22 on OpenWebText; slimGPT sits in a reasonable range for its training budget.
Training loss
Perplexity comparison
Citation
If you use this model in your work, please credit:
@misc{slimgpt2026,
author = {Samuel Jayasingh},
title = {slimGPT: A 124M GPT-2-style language model trained from scratch},
year = {2026},
publisher = {Hugging Face},
howpublished = {\url{https://huggingface.co/samueljayasingh/slimGPT}}
}
Credits
Inspired by Andrej Karpathy's "Let's reproduce GPT-2 (124M)" tutorial: https://www.youtube.com/watch?v=l8pRSuU81PU Special thanks to Andrej Karpathy for making modern LLM training and implementation accessible through open educational content.
License
This model is released under the MIT License.
- Downloads last month
- 78


