This is a continued pre-training checkpoint trained for 1B additional next-token-prediction tokens on top of mrinaalarora/mrinaal-124m-base.

The 1B-token CPT mix is:

share component dataset train tokens validation tokens
50% fineweb-edu-dedup HuggingFaceTB/smollm-corpus, subset fineweb-edu-dedup 500M 10M
30% dclm-baseline-1.0 mlfoundations/dclm-baseline-1.0 300M 6M
15% finemath-4plus HuggingFaceTB/finemath, subset finemath-4plus 150M 3M
5% cosmopedia-v2 HuggingFaceTB/smollm-corpus, subset cosmopedia-v2 50M 1M

mrinaal-124m-base-v2

124M-parameter decoder-only causal language model continued-pretrained from mrinaalarora/mrinaal-124m-base. This v2 checkpoint adds 1B more next-token-prediction tokens on a mixed data recipe.

model config

param value
parameters ~124M
layers 12
hidden size 768
attention heads 12
context length 1024 tokens
vocab size 50257
positional encoding RoPE
norm RMSNorm
activation SwiGLU
tokenizer GPT-2 tokenizer

training

  • base checkpoint: mrinaalarora/mrinaal-124m-base/model_best.safetensors
  • continued pretraining data: 1B GPT-2 tokens
  • validation data: 20M GPT-2 tokens
  • mix: 50% fineweb-edu-dedup, 30% dclm-baseline-1.0, 15% finemath-4plus, 5% cosmopedia-v2
  • dataset dir: /vol/datasets/cpt_mix_gpt2_1b_train
  • validation dir: /vol/datasets/cpt_mix_gpt2_20m_val
  • optimizer steps: 100000
  • train loss at saved checkpoint: 3.50816011428833
  • validation loss at saved checkpoint: 3.3754064226150513
  • hardware: NVIDIA H100

files

  • model.safetensors — best continued-pretraining checkpoint
  • run_summary.json — full training run metadata
  • last.pt was not uploaded; this repo intentionally publishes the best checkpoint only.

loading

import torch
from safetensors.torch import load_file

state_dict = load_file("model.safetensors")

To use with the original model class, clone the training repo and:

from first_llm_pretrain.model import DecoderOnlyTransformer, ModelConfig

config = ModelConfig(
    vocab_size=50257,
    block_size=1024,
    n_layer=12,
    n_head=12,
    n_embd=768,
)
model = DecoderOnlyTransformer(config)
model.load_state_dict(load_file("model.safetensors"), strict=False)
model.eval()

strict=False is used because the safetensors conversion removes the duplicate lm_head.weight tensor and keeps token_embedding.weight; the original model class ties those weights.

Downloads last month

-

Downloads are not tracked for this model. How to track
Safetensors
Model size
0.1B params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for mrinaalarora/mrinaal-124m-base-v2

Finetunes
2 models

Datasets used to train mrinaalarora/mrinaal-124m-base-v2

Collection including mrinaalarora/mrinaal-124m-base-v2