Joey π€ β a diffusion language model, from scratch
Joey is a ~170M-parameter masked / absorbing-state diffusion language model (MDLM / LLaDA
family), implemented from scratch in PyTorch. Instead of generating left-to-right like GPT, it
generates by iterative denoising: starting from a fully [MASK]ed sequence and progressively
unmasking, re-deciding low-confidence tokens along the way.
Code & full write-up: https://github.com/CLoaKY233/joey
Status: work in progress. This checkpoint is a small base + conversational fine-tune. It is fluent but capacity-limited β it learns grammar and conversational register, not sustained meaning. Scaling up is the next milestone.
Files
| File | Description |
|---|---|
joey_chat.pt |
Conversational model (base + DailyDialog SFT) β use this to chat |
joey_base.pt |
Base model after pretraining (step 174k) |
tok.json |
The 16K ByteLevel BPE tokenizer |
Model details
| Property | Value |
|---|---|
| Parameters | ~170M |
| Backbone | Bidirectional Transformer (no causal mask), timestep-conditioned |
d_model / layers / heads |
1024 / 12 / 16 |
| Context length | 256 |
| Vocabulary | 16,384 (custom ByteLevel BPE) |
| Objective | Masked diffusion, 1/t-weighted cross-entropy on masked positions |
| Training data | FineWeb-Edu (~2B tokens) |
| Fine-tuning | DailyDialog, response-only masking (LLaDA-style SFT) |
| Sampler | Remasking (MaskGIT/LLaDA) + repetition penalty + top-p |
Usage
Clone the code repo, place joey_chat.pt and tok.json in
artifacts/, then:
uv run python scripts/chat.py
References
- Sahoo et al., Simple and Effective Masked Diffusion Language Models (MDLM), NeurIPS 2024
- Nie et al., Large Language Diffusion Models (LLaDA), 2025
- Chang et al., MaskGIT: Masked Generative Image Transformers, CVPR 2022
License
MIT