pile-30m-base

A 29M-parameter decoder-only Transformer trained from scratch on a ~30M-token subset of The Pile. Built end-to-end on an Apple M2 MacBook Air (8GB unified memory) using PyTorch MPS β€” no CUDA, no pretrained weights.

This is a base / pretraining checkpoint, not instruction-tuned. Expect locally fluent English that loses coherence after a sentence or two β€” typical for this scale and token budget.

Architecture

Component Value
Parameters 29,027,968
Embedding dim (n_embed) 256
Attention heads (n_head) 8
Transformer blocks (N_BLOCKS) 4
Context length 256
Vocab size 50,304 (GPT-2 tokenizer, padded)
Positional embeddings Learned

Custom implementation (not a transformers class). Token + learned-position embeddings β†’ 4 pre-norm Transformer blocks β†’ final LayerNorm β†’ tied-free LM head.

Training

Setting Value
Optimizer AdamW
LR 5e-4, decayed to 5e-5 at step 11,000
Steps 15,000
Batch size 16
Train context 256
Grad clip 1.0 (max norm)
Tokens seen ~61M (β‰ˆ2 tokens/param)
Hardware Apple M2, MPS backend
Wall time ~3.5 hours

Results: final train loss 3.918, dev loss 4.006 (perplexity β‰ˆ 55).

Note: at ~2 tokens/parameter this is well under the Chinchilla-optimal ~20:1 ratio, so the model is data-limited rather than capacity-limited β€” dev loss had largely flattened by the end of training.

Checkpoint format

The .pt file is a dict, not a bare state_dict:

{
    "model_state_dict": ...,
    "optimizer_state_dict": ...,
    "losses": [...],
    "train_loss": 3.918,
    "dev_loss": 4.006,
    "steps": 15000,
    "device": "mps",
    "pytorch_version": "...",
    "cuda_version": None
}

Usage

This model needs its original architecture code to load. The model definition (Transformer + its Block) is not bundled here β€” clone it from the training repo, or drop in modeling.py if included in this repo.

import torch
from huggingface_hub import hf_hub_download
# from your repo: from src.models.transformer import Transformer

ckpt_path = hf_hub_download("Suyash11/pile-30m-base", "pile-30m-base.pt")
ckpt = torch.load(ckpt_path, map_location="cpu")

model = Transformer(
    n_head=8, n_embed=256, context_length=256,
    vocab_size=50304, N_BLOCKS=4,
)
model.load_state_dict(ckpt["model_state_dict"])
model.eval()

# generate (tokenize with the GPT-2 tokenizer)
import tiktoken
enc = tiktoken.get_encoding("gpt2")
idx = torch.tensor([enc.encode("The meaning of life is")], dtype=torch.long)
out = model.generate(idx, max_new_tokens=50)
print(enc.decode(out[0].tolist()))

Limitations

  • Tiny model, tiny token budget β€” not useful for downstream tasks as-is.
  • Base checkpoint: no instruction tuning, alignment, or safety filtering.
  • Trained on a small Pile subset; inherits that data's biases and artifacts.
  • Intended for education / research into from-scratch LLM training.

License

Apache-2.0.

Downloads last month
30
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Dataset used to train Suyash11/pile-30m-base