Joey 🐤 — a diffusion language model, from scratch

Joey is a ~170M-parameter masked / absorbing-state diffusion language model (MDLM / LLaDA family), implemented from scratch in PyTorch. Instead of generating left-to-right like GPT, it generates by iterative denoising: starting from a fully [MASK]ed sequence and progressively unmasking, re-deciding low-confidence tokens along the way.

Code & full write-up: https://github.com/CLoaKY233/joey

Status: work in progress. This checkpoint is a small base + conversational fine-tune. It is fluent but capacity-limited — it learns grammar and conversational register, not sustained meaning. Scaling up is the next milestone.

Files

File	Description
`joey_chat.pt`	Conversational model (base + DailyDialog SFT) — use this to chat
`joey_base.pt`	Base model after pretraining (step 174k)
`tok.json`	The 16K ByteLevel BPE tokenizer

Model details

Property	Value
Parameters	~170M
Backbone	Bidirectional Transformer (no causal mask), timestep-conditioned
`d_model` / layers / heads	1024 / 12 / 16
Context length	256
Vocabulary	16,384 (custom ByteLevel BPE)
Objective	Masked diffusion, `1/t`-weighted cross-entropy on masked positions
Training data	FineWeb-Edu (~2B tokens)
Fine-tuning	DailyDialog, response-only masking (LLaDA-style SFT)
Sampler	Remasking (MaskGIT/LLaDA) + repetition penalty + top-p

Usage

Clone the code repo, place joey_chat.pt and tok.json in artifacts/, then:

uv run python scripts/chat.py

References

Sahoo et al., Simple and Effective Masked Diffusion Language Models (MDLM), NeurIPS 2024
Nie et al., Large Language Diffusion Models (LLaDA), 2025
Chang et al., MaskGIT: Masked Generative Image Transformers, CVPR 2022

License

MIT

Downloads last month: -; Downloads are not tracked for this model. How to track