Instructions to use dogtooth/open-lm-3b-201305-midtrain with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use dogtooth/open-lm-3b-201305-midtrain with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="dogtooth/open-lm-3b-201305-midtrain", trust_remote_code=True)
messages = [
    {"role": "user", "content": "Who are you?"},
]
pipe(messages)

# Load model directly
from transformers import AutoModel
model = AutoModel.from_pretrained("dogtooth/open-lm-3b-201305-midtrain", trust_remote_code=True, dtype="auto")

Notebooks
Google Colab
Kaggle
Local Apps

vLLM

How to use dogtooth/open-lm-3b-201305-midtrain with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "dogtooth/open-lm-3b-201305-midtrain"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "dogtooth/open-lm-3b-201305-midtrain",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/dogtooth/open-lm-3b-201305-midtrain

SGLang

How to use dogtooth/open-lm-3b-201305-midtrain with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "dogtooth/open-lm-3b-201305-midtrain" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "dogtooth/open-lm-3b-201305-midtrain",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "dogtooth/open-lm-3b-201305-midtrain" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "dogtooth/open-lm-3b-201305-midtrain",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Docker Model Runner
How to use dogtooth/open-lm-3b-201305-midtrain with Docker Model Runner:
```
docker model run hf.co/dogtooth/open-lm-3b-201305-midtrain
```

Open LM 3B — Mid-Trained (Knowledge Cutoff May 2013)

Mid-training continuation of the Apple Open LM 3B oracle model with knowledge cutoff May 2013, from the TiC-LM (Time-Continual Language Modeling) / Chrononauts project.

The mid-training stage re-exposes the model to pre-cutoff facts drawn from peS2o, Wikipedia, and DCLM to consolidate (rather than extend) the model's knowledge. No post-cutoff text is included.

Trained with LLaMA-Factory (finetuning_type: full, DeepSpeed ZeRO-2).

Model Details

Property	Value
Base model	`dogtooth/open-lm-3b-201305`
Architecture	LLaMA-style with QK norm (`OpenLMForCausalLM`, custom code)
Parameters	~2.8B
Knowledge cutoff	May 2013
Vocab size	50,432
Context length	2,048
Mid-train framework	LLaMA-Factory (full FT, DeepSpeed ZeRO-2)
Mid-train data	peS2o + Wikipedia + DCLM, pre-cutoff only

Usage

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

model = AutoModelForCausalLM.from_pretrained(
    "dogtooth/open-lm-3b-201305-midtrain",
    dtype=torch.bfloat16,
    device_map="auto",
    trust_remote_code=True,
)
tokenizer = AutoTokenizer.from_pretrained(
    "dogtooth/open-lm-3b-201305-midtrain", trust_remote_code=True
)

Repository Contents

Final model weights at the repo root (model-*.safetensors)
Intermediate checkpoints in checkpoint-14000/, checkpoint-16000/, checkpoint-16034/ (HF-format weights only; DeepSpeed optimizer shards omitted)
trainer_state.json, trainer_log.jsonl, all_results.json, train_results.json

Citation

@article{jain2024ticlm,
  title={Time-Continual Learning from a Streaming Language Model},
  author={Jain, Ameya and Ramesh, Aakanksha and Li, Tianjian and others},
  journal={arXiv preprint arXiv:2410.14660},
  year={2024}
}

Mid-Training Data Recipe (201305 cutoff)

Three pre-cutoff text sources are concatenated (no upsampling), packed to a 2,048-token context, and trained for one epoch.

Source	Time filter	Documents	Est. tokens
peS2o (academic abstracts/full text)	published before May 2013	1,859,534	~1.0 B
Wikipedia (English)	first-revision date before May 2013	3,966,112	~3.5 B
DCLM (Common Crawl, filtered)	none (assumed pre-cutoff web text)	3,218,997	~4.5 B
Total		~9.0 M docs	~9.0 B

Token estimates use a chars-per-token ratio of ~4 (verified ratios are ~0.21–0.23 tokens/char with the OpenLM tokenizer; the table reports the 4-char approximation). See the project repo for the per-cutoff data prep code (prepare_midtrain_data.py) and the slice statistics (stats.json).

LLaMA-Factory dataset wiring

dataset: midtrain_pes2o_pre201305,midtrain_wiki_pre201305,midtrain_dclm
template: empty
cutoff_len: 2048
mix_strategy: concat

Per-source files (relative to the dataset root):

midtrain/pes2o_slices/pes2o_pre201305_1b.jsonl
midtrain/wiki_slices/wiki_pre201305.jsonl
midtrain/dclm_4_5b.jsonl

All three are jsonl with a single text column.

Training hyperparameters

Hyperparameter	Value
Framework	LLaMA-Factory `stage: pt`, `finetuning_type: full`
Optimizer	DeepSpeed ZeRO-2
Precision	bf16
GPUs	4 × H200
Per-device batch	64
Gradient accumulation	1
Effective batch (tokens)	4 × 64 × 2048 ≈ 524,288 / step
Learning rate	5.0e-5, cosine schedule, 3% warmup
Epochs	1.0
Total optimizer steps	16,034
Tokens consumed	~8.4 B (≈ 1 pass over the corpus)

Why mid-train?

The mid-training stage re-exposes the model to pre-cutoff facts drawn from peS2o, Wikipedia, and DCLM to consolidate (rather than extend) the model's knowledge. No post-cutoff text is included, so the knowledge cutoff date is preserved while the representation of pre-cutoff knowledge is strengthened.