allenai/dolma
Updated • 4.9k • 1.03k
A 781M-parameter GPT-2-shape language model trained on a subset of Dolma v1_7. This is the mhc-lite variant of a 3-way comparison (residual / mhc / mhc-lite).
tiktoken gpt2, vocab 50304)Recipe follows "Ablate and Rescue" (arxiv 2603.14833) with one deviation: effective batch size is reduced from the paper's 0.5M tokens/step to fit on a single GPU.
| this run | paper | |
|---|---|---|
| Tokens / step | 65,536 | 524,288 |
| Steps | 10,000 | 10,000+ |
| Tokens seen | ~655M | ~3.18B |
| LR / min_lr | 3e-4 / 3e-5 | 3e-4 / 3e-5 |
| Warmup / decay | 200 / 10K (cosine) | 200 / 10K (cosine) |
| Weight decay | 0.1 | 0.1 |
| β1, β2 | 0.9, 0.95 | 0.9, 0.95 |
| Grad clip | 1.0 | 1.0 |
| Bias | True | True |
This is not a transformers-native model. The state_dict targets the
nanoGPT-style GPT(GPTConfig) class in this repo. To use it:
from huggingface_hub import snapshot_download
import sys, json, torch
from safetensors.torch import load_file
local = snapshot_download(repo_id="Realmbird/mhc-781m-mhc-lite")
sys.path.insert(0, local)
from model import GPT, GPTConfig
with open(f"{local}/config.json") as f:
cfg = GPTConfig(**json.load(f))
model = GPT(cfg)
sd = load_file(f"{local}/model.safetensors")
model.load_state_dict(sd)
model.eval()