mhc-781m-mhc-lite

A 781M-parameter GPT-2-shape language model trained on a subset of Dolma v1_7. This is the mhc-lite variant of a 3-way comparison (residual / mhc / mhc-lite).

Training method: mHC-lite (lightweight HC variant from the mhc-lite paper)
Architecture: ≈ 781M (GPT-2 Large shape, 36×1280, 4 streams when HC enabled)
Tokenizer: GPT-2 BPE (tiktoken gpt2, vocab 50304)
Sequence length: 1024
Optimizer: AdamW (paper used AdamW + Muon; here AdamW only)
Best validation loss: 3.6180 at iter 10000

Training recipe

Recipe follows "Ablate and Rescue" (arxiv 2603.14833) with one deviation: effective batch size is reduced from the paper's 0.5M tokens/step to fit on a single GPU.

	this run	paper
Tokens / step	65,536	524,288
Steps	10,000	10,000+
Tokens seen	~655M	~3.18B
LR / min_lr	3e-4 / 3e-5	3e-4 / 3e-5
Warmup / decay	200 / 10K (cosine)	200 / 10K (cosine)
Weight decay	0.1	0.1
β1, β2	0.9, 0.95	0.9, 0.95
Grad clip	1.0	1.0
Bias	True	True

Load and run

This is not a transformers-native model. The state_dict targets the nanoGPT-style GPT(GPTConfig) class in this repo. To use it:

from huggingface_hub import snapshot_download
import sys, json, torch
from safetensors.torch import load_file

local = snapshot_download(repo_id="Realmbird/mhc-781m-mhc-lite")
sys.path.insert(0, local)
from model import GPT, GPTConfig

with open(f"{local}/config.json") as f:
    cfg = GPTConfig(**json.load(f))
model = GPT(cfg)
sd = load_file(f"{local}/model.safetensors")
model.load_state_dict(sd)
model.eval()

Companion variants

Downloads last month: 23

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Dataset used to train Realmbird/mhc-781m-mhc-lite

Collection including Realmbird/mhc-781m-mhc-lite

MHC model diff

Collection

Trying a mhc model, mhc lite, and base model at nanogpt size • 3 items • Updated 13 days ago