pile-30m-base
A 29M-parameter decoder-only Transformer trained from scratch on a ~30M-token subset of The Pile. Built end-to-end on an Apple M2 MacBook Air (8GB unified memory) using PyTorch MPS β no CUDA, no pretrained weights.
This is a base / pretraining checkpoint, not instruction-tuned. Expect locally fluent English that loses coherence after a sentence or two β typical for this scale and token budget.
Architecture
| Component | Value |
|---|---|
| Parameters | 29,027,968 |
Embedding dim (n_embed) |
256 |
Attention heads (n_head) |
8 |
Transformer blocks (N_BLOCKS) |
4 |
| Context length | 256 |
| Vocab size | 50,304 (GPT-2 tokenizer, padded) |
| Positional embeddings | Learned |
Custom implementation (not a transformers class). Token + learned-position
embeddings β 4 pre-norm Transformer blocks β final LayerNorm β tied-free LM head.
Training
| Setting | Value |
|---|---|
| Optimizer | AdamW |
| LR | 5e-4, decayed to 5e-5 at step 11,000 |
| Steps | 15,000 |
| Batch size | 16 |
| Train context | 256 |
| Grad clip | 1.0 (max norm) |
| Tokens seen | ~61M (β2 tokens/param) |
| Hardware | Apple M2, MPS backend |
| Wall time | ~3.5 hours |
Results: final train loss 3.918, dev loss 4.006 (perplexity β 55).
Note: at ~2 tokens/parameter this is well under the Chinchilla-optimal ~20:1 ratio, so the model is data-limited rather than capacity-limited β dev loss had largely flattened by the end of training.
Checkpoint format
The .pt file is a dict, not a bare state_dict:
{
"model_state_dict": ...,
"optimizer_state_dict": ...,
"losses": [...],
"train_loss": 3.918,
"dev_loss": 4.006,
"steps": 15000,
"device": "mps",
"pytorch_version": "...",
"cuda_version": None
}
Usage
This model needs its original architecture code to load. The model definition
(Transformer + its Block) is not bundled here β clone it from the
training repo, or drop in modeling.py if included in this repo.
import torch
from huggingface_hub import hf_hub_download
# from your repo: from src.models.transformer import Transformer
ckpt_path = hf_hub_download("Suyash11/pile-30m-base", "pile-30m-base.pt")
ckpt = torch.load(ckpt_path, map_location="cpu")
model = Transformer(
n_head=8, n_embed=256, context_length=256,
vocab_size=50304, N_BLOCKS=4,
)
model.load_state_dict(ckpt["model_state_dict"])
model.eval()
# generate (tokenize with the GPT-2 tokenizer)
import tiktoken
enc = tiktoken.get_encoding("gpt2")
idx = torch.tensor([enc.encode("The meaning of life is")], dtype=torch.long)
out = model.generate(idx, max_new_tokens=50)
print(enc.decode(out[0].tolist()))
Limitations
- Tiny model, tiny token budget β not useful for downstream tasks as-is.
- Base checkpoint: no instruction tuning, alignment, or safety filtering.
- Trained on a small Pile subset; inherits that data's biases and artifacts.
- Intended for education / research into from-scratch LLM training.
License
Apache-2.0.
- Downloads last month
- 30