Open LM 3B — Mid-Trained (Knowledge Cutoff May 2013)

Mid-training continuation of the Apple Open LM 3B oracle model with knowledge cutoff May 2013, from the TiC-LM (Time-Continual Language Modeling) / Chrononauts project.

The mid-training stage re-exposes the model to pre-cutoff facts drawn from peS2o, Wikipedia, and DCLM to consolidate (rather than extend) the model's knowledge. No post-cutoff text is included.

Trained with LLaMA-Factory (finetuning_type: full, DeepSpeed ZeRO-2).

Model Details

Property Value
Base model dogtooth/open-lm-3b-201305
Architecture LLaMA-style with QK norm (OpenLMForCausalLM, custom code)
Parameters ~2.8B
Knowledge cutoff May 2013
Vocab size 50,432
Context length 2,048
Mid-train framework LLaMA-Factory (full FT, DeepSpeed ZeRO-2)
Mid-train data peS2o + Wikipedia + DCLM, pre-cutoff only

Usage

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

model = AutoModelForCausalLM.from_pretrained(
    "dogtooth/open-lm-3b-201305-midtrain",
    dtype=torch.bfloat16,
    device_map="auto",
    trust_remote_code=True,
)
tokenizer = AutoTokenizer.from_pretrained(
    "dogtooth/open-lm-3b-201305-midtrain", trust_remote_code=True
)

Repository Contents

  • Final model weights at the repo root (model-*.safetensors)
  • Intermediate checkpoints in checkpoint-14000/, checkpoint-16000/, checkpoint-16034/ (HF-format weights only; DeepSpeed optimizer shards omitted)
  • trainer_state.json, trainer_log.jsonl, all_results.json, train_results.json

Citation

@article{jain2024ticlm,
  title={Time-Continual Learning from a Streaming Language Model},
  author={Jain, Ameya and Ramesh, Aakanksha and Li, Tianjian and others},
  journal={arXiv preprint arXiv:2410.14660},
  year={2024}
}

Mid-Training Data Recipe (201305 cutoff)

Three pre-cutoff text sources are concatenated (no upsampling), packed to a 2,048-token context, and trained for one epoch.

Source Time filter Documents Est. tokens
peS2o (academic abstracts/full text) published before May 2013 1,859,534 ~1.0 B
Wikipedia (English) first-revision date before May 2013 3,966,112 ~3.5 B
DCLM (Common Crawl, filtered) none (assumed pre-cutoff web text) 3,218,997 ~4.5 B
Total ~9.0 M docs ~9.0 B

Token estimates use a chars-per-token ratio of ~4 (verified ratios are ~0.21–0.23 tokens/char with the OpenLM tokenizer; the table reports the 4-char approximation). See the project repo for the per-cutoff data prep code (prepare_midtrain_data.py) and the slice statistics (stats.json).

LLaMA-Factory dataset wiring

dataset: midtrain_pes2o_pre201305,midtrain_wiki_pre201305,midtrain_dclm
template: empty
cutoff_len: 2048
mix_strategy: concat

Per-source files (relative to the dataset root):

  • midtrain/pes2o_slices/pes2o_pre201305_1b.jsonl
  • midtrain/wiki_slices/wiki_pre201305.jsonl
  • midtrain/dclm_4_5b.jsonl

All three are jsonl with a single text column.

Training hyperparameters

Hyperparameter Value
Framework LLaMA-Factory stage: pt, finetuning_type: full
Optimizer DeepSpeed ZeRO-2
Precision bf16
GPUs 4 × H200
Per-device batch 64
Gradient accumulation 1
Effective batch (tokens) 4 × 64 × 2048 ≈ 524,288 / step
Learning rate 5.0e-5, cosine schedule, 3% warmup
Epochs 1.0
Total optimizer steps 16,034
Tokens consumed ~8.4 B (≈ 1 pass over the corpus)

Why mid-train?

The mid-training stage re-exposes the model to pre-cutoff facts drawn from peS2o, Wikipedia, and DCLM to consolidate (rather than extend) the model's knowledge. No post-cutoff text is included, so the knowledge cutoff date is preserved while the representation of pre-cutoff knowledge is strengthened.

Downloads last month
44
Safetensors
Model size
3B params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for dogtooth/open-lm-3b-201305-midtrain

Finetuned
(4)
this model
Finetunes
1 model

Paper for dogtooth/open-lm-3b-201305-midtrain