mhc-781m-mhc-lite

A 781M-parameter GPT-2-shape language model trained on a subset of Dolma v1_7. This is the mhc-lite variant of a 3-way comparison (residual / mhc / mhc-lite).

  • Training method: mHC-lite (lightweight HC variant from the mhc-lite paper)
  • Architecture: ≈ 781M (GPT-2 Large shape, 36×1280, 4 streams when HC enabled)
  • Tokenizer: GPT-2 BPE (tiktoken gpt2, vocab 50304)
  • Sequence length: 1024
  • Optimizer: AdamW (paper used AdamW + Muon; here AdamW only)
  • Best validation loss: 3.6180 at iter 10000

Training recipe

Recipe follows "Ablate and Rescue" (arxiv 2603.14833) with one deviation: effective batch size is reduced from the paper's 0.5M tokens/step to fit on a single GPU.

this run paper
Tokens / step 65,536 524,288
Steps 10,000 10,000+
Tokens seen ~655M ~3.18B
LR / min_lr 3e-4 / 3e-5 3e-4 / 3e-5
Warmup / decay 200 / 10K (cosine) 200 / 10K (cosine)
Weight decay 0.1 0.1
β1, β2 0.9, 0.95 0.9, 0.95
Grad clip 1.0 1.0
Bias True True

Load and run

This is not a transformers-native model. The state_dict targets the nanoGPT-style GPT(GPTConfig) class in this repo. To use it:

from huggingface_hub import snapshot_download
import sys, json, torch
from safetensors.torch import load_file

local = snapshot_download(repo_id="Realmbird/mhc-781m-mhc-lite")
sys.path.insert(0, local)
from model import GPT, GPTConfig

with open(f"{local}/config.json") as f:
    cfg = GPTConfig(**json.load(f))
model = GPT(cfg)
sd = load_file(f"{local}/model.safetensors")
model.load_state_dict(sd)
model.eval()

Companion variants

Downloads last month
23
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Dataset used to train Realmbird/mhc-781m-mhc-lite

Collection including Realmbird/mhc-781m-mhc-lite