pile-30m-base

A 29M-parameter decoder-only Transformer trained from scratch on a ~30M-token subset of The Pile. Built end-to-end on an Apple M2 MacBook Air (8GB unified memory) using PyTorch MPS — no CUDA, no pretrained weights.

This is a base / pretraining checkpoint, not instruction-tuned. Expect locally fluent English that loses coherence after a sentence or two — typical for this scale and token budget.

Architecture

Component	Value
Parameters	29,027,968
Embedding dim (`n_embed`)	256
Attention heads (`n_head`)	8
Transformer blocks (`N_BLOCKS`)	4
Context length	256
Vocab size	50,304 (GPT-2 tokenizer, padded)
Positional embeddings	Learned

Custom implementation (not a transformers class). Token + learned-position embeddings → 4 pre-norm Transformer blocks → final LayerNorm → tied-free LM head.

Training

Setting	Value
Optimizer	AdamW
LR	5e-4, decayed to 5e-5 at step 11,000
Steps	15,000
Batch size	16
Train context	256
Grad clip	1.0 (max norm)
Tokens seen	~61M (≈2 tokens/param)
Hardware	Apple M2, MPS backend
Wall time	~3.5 hours

Results: final train loss 3.918, dev loss 4.006 (perplexity ≈ 55).

Note: at ~2 tokens/parameter this is well under the Chinchilla-optimal ~20:1 ratio, so the model is data-limited rather than capacity-limited — dev loss had largely flattened by the end of training.

Checkpoint format

The .pt file is a dict, not a bare state_dict:

{
    "model_state_dict": ...,
    "optimizer_state_dict": ...,
    "losses": [...],
    "train_loss": 3.918,
    "dev_loss": 4.006,
    "steps": 15000,
    "device": "mps",
    "pytorch_version": "...",
    "cuda_version": None
}

Usage

This model needs its original architecture code to load. The model definition (Transformer + its Block) is not bundled here — clone it from the training repo, or drop in modeling.py if included in this repo.

import torch
from huggingface_hub import hf_hub_download
# from your repo: from src.models.transformer import Transformer

ckpt_path = hf_hub_download("Suyash11/pile-30m-base", "pile-30m-base.pt")
ckpt = torch.load(ckpt_path, map_location="cpu")

model = Transformer(
    n_head=8, n_embed=256, context_length=256,
    vocab_size=50304, N_BLOCKS=4,
)
model.load_state_dict(ckpt["model_state_dict"])
model.eval()

# generate (tokenize with the GPT-2 tokenizer)
import tiktoken
enc = tiktoken.get_encoding("gpt2")
idx = torch.tensor([enc.encode("The meaning of life is")], dtype=torch.long)
out = model.generate(idx, max_new_tokens=50)
print(enc.decode(out[0].tolist()))

Limitations

Tiny model, tiny token budget — not useful for downstream tasks as-is.
Base checkpoint: no instruction tuning, alignment, or safety filtering.
Trained on a small Pile subset; inherits that data's biases and artifacts.
Intended for education / research into from-scratch LLM training.

License

Apache-2.0.

Downloads last month: 30

Suyash11
/

pile-30m-base