pile-13m-base

A 13.1M-parameter decoder-only transformer I pretrained from random initialization on a subset of the Pile (uncopyrighted). Trained end-to-end on a single Apple M2 MacBook Air (8GB unified memory) using PyTorch's MPS backend — no CUDA, no cloud.

This repository contains the trained weights only (a raw state_dict). It is a base (pretraining-only) model: a raw next-token text continuer. It has not been instruction-tuned or aligned — it does not follow instructions or hold a conversation. Intended as a learning artifact and a controlled, fully-documented baseline.

Architecture

Property	Value
Parameters	13,142,656
Type	Decoder-only transformer
Transformer blocks	1
Embedding dimension	128
Attention heads	8
Context length	128 tokens
Vocabulary	50,304 (GPT-2 BPE / tiktoken)

The embedding/output table (50,304 × 128) accounts for most of the parameter count.

Training

Property	Value
Data	Pile-uncopyrighted, top 50,000 documents
Tokens seen	~100M (≈1.5 passes over the data)
Steps	200,000
Batch / context	32 sequences × 128 tokens
Hardware	Apple M2, 8GB, MPS backend
Wall-clock	~5h 8m
Throughput	~5,800 tokens/s sustained
Final train loss	4.555
Final dev loss	4.859 (perplexity ≈ 129)

The train/dev gap held at ~0.30 through the second half of training — the model generalized rather than memorized at this data scale.

Intended use & limitations

Intended: education, experimentation, a known-baseline for evaluation harnesses.

Known limitations:

No long-range coherence. With 128-token context and short training windows, output holds local grammar for ~10 tokens then drifts in topic.
Capacity-bound. At 13M parameters the loss plateaus near 4.5–4.6; lower loss needs more parameters or longer context, not more steps.
Training-data contamination in output. The Pile contains source code and LaTeX; the model occasionally emits code/markup fragments mid-prose.
No alignment. No SFT/RLHF, no safety tuning. Not for production use.

How to load

This file is a raw state_dict and is not runnable on its own — it needs a matching model class. Instantiate a decoder-only transformer with the architecture values in the table above, then load model_state_dict from pile-13m-base.pt. Compatible architecture code (MIT-licensed): https://github.com/FareedKhan-dev/train-llm-from-scratch

Provenance

Weights are the output of my own training run on the hardware described above. Data lineage, methodology, loss trajectory, and limitations are reported from that run. Tokenizer: GPT-2 BPE (r50k_base) via tiktoken. The Pile-uncopyrighted subset carries its own terms.

Downloads last month: -; Downloads are not tracked for this model. How to track

Suyash11
/

pile-13m-base