pile-13m-base
A 13.1M-parameter decoder-only transformer I pretrained from random initialization on a subset of the Pile (uncopyrighted). Trained end-to-end on a single Apple M2 MacBook Air (8GB unified memory) using PyTorch's MPS backend β no CUDA, no cloud.
This repository contains the trained weights only (a raw state_dict). It is a
base (pretraining-only) model: a raw next-token text continuer. It has not been
instruction-tuned or aligned β it does not follow instructions or hold a conversation.
Intended as a learning artifact and a controlled, fully-documented baseline.
Architecture
| Property | Value |
|---|---|
| Parameters | 13,142,656 |
| Type | Decoder-only transformer |
| Transformer blocks | 1 |
| Embedding dimension | 128 |
| Attention heads | 8 |
| Context length | 128 tokens |
| Vocabulary | 50,304 (GPT-2 BPE / tiktoken) |
The embedding/output table (50,304 Γ 128) accounts for most of the parameter count.
Training
| Property | Value |
|---|---|
| Data | Pile-uncopyrighted, top 50,000 documents |
| Tokens seen | ~100M (β1.5 passes over the data) |
| Steps | 200,000 |
| Batch / context | 32 sequences Γ 128 tokens |
| Hardware | Apple M2, 8GB, MPS backend |
| Wall-clock | ~5h 8m |
| Throughput | ~5,800 tokens/s sustained |
| Final train loss | 4.555 |
| Final dev loss | 4.859 (perplexity β 129) |
The train/dev gap held at ~0.30 through the second half of training β the model generalized rather than memorized at this data scale.
Intended use & limitations
Intended: education, experimentation, a known-baseline for evaluation harnesses.
Known limitations:
- No long-range coherence. With 128-token context and short training windows, output holds local grammar for ~10 tokens then drifts in topic.
- Capacity-bound. At 13M parameters the loss plateaus near 4.5β4.6; lower loss needs more parameters or longer context, not more steps.
- Training-data contamination in output. The Pile contains source code and LaTeX; the model occasionally emits code/markup fragments mid-prose.
- No alignment. No SFT/RLHF, no safety tuning. Not for production use.
How to load
This file is a raw state_dict and is not runnable on its own β it needs a matching
model class. Instantiate a decoder-only transformer with the architecture values in the
table above, then load model_state_dict from pile-13m-base.pt. Compatible
architecture code (MIT-licensed): https://github.com/FareedKhan-dev/train-llm-from-scratch
Provenance
Weights are the output of my own training run on the hardware described above.
Data lineage, methodology, loss trajectory, and limitations are reported from that run.
Tokenizer: GPT-2 BPE (r50k_base) via tiktoken. The Pile-uncopyrighted subset carries
its own terms.