mdlm-owt-diff1 — summary-conditioned MDLM (DIFF_1), 100k steps

DIFF_1 from the quentin-dlm cascade: a masked-diffusion LM finetuned from kuleshov-group/mdlm-owt to generate OpenWebText documents conditioned on a coarse summary prefix.

Layout [summary 256 | text 768] @ L1024; prefix always revealed (never masked); masked-CE NELBO on the text region only. time_conditioning=False.
169.6M vendored Duo DiT backbone, GPT-2 tokenizer, vocab 50258 ([MASK]=50257, pad=eos=50256).
Data: EER6/openwebtext-coarse (doc_idx >= 2048; first 2048 held out).
Recipe: 100k steps, global batch 384 (8x GH200 DDP), lr 3e-4 cosine (warmup 500), AdamW(0.9, 0.95), wd 0, bf16, EMA 0.99.
These are the EMA weights of checkpoint-100000 (DiT backbone state_dict, same layout as mdlm-owt: model.safetensors at repo root).

Results / caveats: held-out val NELBO 2.996 (ppl 20.0) vs trash-prefix control 3.293 (26.9) — strong conditioning (samples reproduce ~44% of summary content words, 5.5x the shuffled baseline). NOTE: the hot 100k finetune degraded sampling fluency (gen-PPL ~207 @512 steps vs ~59 for the base model); see RESULTS_MDLM_100K.md in the project repo for the full diagnosis (earlier checkpoints sample better; remasking samplers recommended).

Load (project code): duo_core.load_model("EER6/mdlm-owt-diff1", 1024, 50258, device) or as --init_ckpt EER6/mdlm-owt-diff1 in train/train_big_mdlm.py.

Companion control: EER6/mdlm-owt-trash.

Downloads last month: 20

Safetensors

Model size

0.2B params

Tensor type

F32

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support