mlfoundations/dclm-baseline-1.0
Preview • Updated • 653k • 286
This is a continued pre-training checkpoint trained for 1B additional next-token-prediction tokens on top of mrinaalarora/mrinaal-124m-base.
The 1B-token CPT mix is:
| share | component | dataset | train tokens | validation tokens |
|---|---|---|---|---|
| 50% | fineweb-edu-dedup | HuggingFaceTB/smollm-corpus, subset fineweb-edu-dedup |
500M | 10M |
| 30% | dclm-baseline-1.0 | mlfoundations/dclm-baseline-1.0 |
300M | 6M |
| 15% | finemath-4plus | HuggingFaceTB/finemath, subset finemath-4plus |
150M | 3M |
| 5% | cosmopedia-v2 | HuggingFaceTB/smollm-corpus, subset cosmopedia-v2 |
50M | 1M |
124M-parameter decoder-only causal language model continued-pretrained from mrinaalarora/mrinaal-124m-base.
This v2 checkpoint adds 1B more next-token-prediction tokens on a mixed data recipe.
| param | value |
|---|---|
| parameters | ~124M |
| layers | 12 |
| hidden size | 768 |
| attention heads | 12 |
| context length | 1024 tokens |
| vocab size | 50257 |
| positional encoding | RoPE |
| norm | RMSNorm |
| activation | SwiGLU |
| tokenizer | GPT-2 tokenizer |
mrinaalarora/mrinaal-124m-base/model_best.safetensorsfineweb-edu-dedup, 30% dclm-baseline-1.0, 15% finemath-4plus, 5% cosmopedia-v2/vol/datasets/cpt_mix_gpt2_1b_train/vol/datasets/cpt_mix_gpt2_20m_valmodel.safetensors — best continued-pretraining checkpointrun_summary.json — full training run metadatalast.pt was not uploaded; this repo intentionally publishes the best checkpoint only.import torch
from safetensors.torch import load_file
state_dict = load_file("model.safetensors")
To use with the original model class, clone the training repo and:
from first_llm_pretrain.model import DecoderOnlyTransformer, ModelConfig
config = ModelConfig(
vocab_size=50257,
block_size=1024,
n_layer=12,
n_head=12,
n_embd=768,
)
model = DecoderOnlyTransformer(config)
model.load_state_dict(load_file("model.safetensors"), strict=False)
model.eval()
strict=False is used because the safetensors conversion removes the duplicate lm_head.weight tensor and keeps token_embedding.weight; the original model class ties those weights.