TinyLM Checkpoints β Full A/B/C/D Ablation (HPC re-run)
All four trained checkpoints from the TinyLM 275M architecture ablation, re-run on Northeastern Explorer HPC (A100-40GB). Each arm trained 23k steps on 8B unique FineWeb-Edu tokens (~3 epochs, ~24B processed).
For the model card, full eval results, and recommended usage β Shiv-22/tinylm (Run D β the best-performing arm).
Ablation matrix
| Arm | Attention | Optimizer | File | Headline avg |
|---|---|---|---|---|
| A | Standard MHA | AdamW | run_A/step_22999.pt |
43.62% |
| B | MLA | AdamW | run_B/step_22999.pt |
44.11% |
| C | Standard MHA | Muon | run_C/step_22999.pt |
44.64% |
| D | MLA | Muon | run_D/step_22999.pt |
45.14% |
All four arms differ only in attention class (MHA vs MLA) and matrix optimizer (AdamW vs Muon). All other settings β data, schedule, batch size, model dimensions, tokenizer β are identical.
Full breakdown: https://github.com/shivnarainms22/TinyLM/blob/main/results/hpc_rerun_ablation.md
Loading a specific arm
import torch
from huggingface_hub import hf_hub_download
from tinylm.model import TinyLM, ModelConfig
arm = "run_D" # or run_A, run_B, run_C
ckpt_path = hf_hub_download(
repo_id="Shiv-22/tinylm-checkpoints-v2",
filename=f"{arm}/step_22999.pt",
)
ckpt = torch.load(ckpt_path, map_location="cpu", weights_only=True)
model = TinyLM(ModelConfig(**ckpt["config"]))
state = ckpt["model"]
if any(k.startswith("_orig_mod.") for k in state):
state = {k.removeprefix("_orig_mod."): v for k, v in state.items()}
model.load_state_dict(state)
model.eval()
The TinyLM model class lives in the source repo:
github.com/shivnarainms22/TinyLM.
v1 contrast
The earlier RunPod-era TinyLM (1B unique tokens looped ~21Γ, single arm) is
preserved at Shiv-22/tinylm-checkpoints.
The data fix (1BΓ21 β 8B unique, this re-run) was worth +3.97 avg pts over
that v1 on the same MLA+Muon arm β roughly 2.6Γ the architecture+optimizer
ablation gain.
License
Apache 2.0. Inherits the permissive terms of modded-nanogpt (MIT) for the codebase and FineWeb-Edu (ODC-By) for the training data.