pile-13m-base

A 13.1M-parameter decoder-only transformer I pretrained from random initialization on a subset of the Pile (uncopyrighted). Trained end-to-end on a single Apple M2 MacBook Air (8GB unified memory) using PyTorch's MPS backend β€” no CUDA, no cloud.

This repository contains the trained weights only (a raw state_dict). It is a base (pretraining-only) model: a raw next-token text continuer. It has not been instruction-tuned or aligned β€” it does not follow instructions or hold a conversation. Intended as a learning artifact and a controlled, fully-documented baseline.

Architecture

Property Value
Parameters 13,142,656
Type Decoder-only transformer
Transformer blocks 1
Embedding dimension 128
Attention heads 8
Context length 128 tokens
Vocabulary 50,304 (GPT-2 BPE / tiktoken)

The embedding/output table (50,304 Γ— 128) accounts for most of the parameter count.

Training

Property Value
Data Pile-uncopyrighted, top 50,000 documents
Tokens seen ~100M (β‰ˆ1.5 passes over the data)
Steps 200,000
Batch / context 32 sequences Γ— 128 tokens
Hardware Apple M2, 8GB, MPS backend
Wall-clock ~5h 8m
Throughput ~5,800 tokens/s sustained
Final train loss 4.555
Final dev loss 4.859 (perplexity β‰ˆ 129)

The train/dev gap held at ~0.30 through the second half of training β€” the model generalized rather than memorized at this data scale.

Intended use & limitations

Intended: education, experimentation, a known-baseline for evaluation harnesses.

Known limitations:

  • No long-range coherence. With 128-token context and short training windows, output holds local grammar for ~10 tokens then drifts in topic.
  • Capacity-bound. At 13M parameters the loss plateaus near 4.5–4.6; lower loss needs more parameters or longer context, not more steps.
  • Training-data contamination in output. The Pile contains source code and LaTeX; the model occasionally emits code/markup fragments mid-prose.
  • No alignment. No SFT/RLHF, no safety tuning. Not for production use.

How to load

This file is a raw state_dict and is not runnable on its own β€” it needs a matching model class. Instantiate a decoder-only transformer with the architecture values in the table above, then load model_state_dict from pile-13m-base.pt. Compatible architecture code (MIT-licensed): https://github.com/FareedKhan-dev/train-llm-from-scratch

Provenance

Weights are the output of my own training run on the hardware described above. Data lineage, methodology, loss trajectory, and limitations are reported from that run. Tokenizer: GPT-2 BPE (r50k_base) via tiktoken. The Pile-uncopyrighted subset carries its own terms.

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Dataset used to train Suyash11/pile-13m-base