Joey 🐀 β€” a diffusion language model, from scratch

Joey is a ~170M-parameter masked / absorbing-state diffusion language model (MDLM / LLaDA family), implemented from scratch in PyTorch. Instead of generating left-to-right like GPT, it generates by iterative denoising: starting from a fully [MASK]ed sequence and progressively unmasking, re-deciding low-confidence tokens along the way.

Code & full write-up: https://github.com/CLoaKY233/joey

Status: work in progress. This checkpoint is a small base + conversational fine-tune. It is fluent but capacity-limited β€” it learns grammar and conversational register, not sustained meaning. Scaling up is the next milestone.

Files

File Description
joey_chat.pt Conversational model (base + DailyDialog SFT) β€” use this to chat
joey_base.pt Base model after pretraining (step 174k)
tok.json The 16K ByteLevel BPE tokenizer

Model details

Property Value
Parameters ~170M
Backbone Bidirectional Transformer (no causal mask), timestep-conditioned
d_model / layers / heads 1024 / 12 / 16
Context length 256
Vocabulary 16,384 (custom ByteLevel BPE)
Objective Masked diffusion, 1/t-weighted cross-entropy on masked positions
Training data FineWeb-Edu (~2B tokens)
Fine-tuning DailyDialog, response-only masking (LLaDA-style SFT)
Sampler Remasking (MaskGIT/LLaDA) + repetition penalty + top-p

Usage

Clone the code repo, place joey_chat.pt and tok.json in artifacts/, then:

uv run python scripts/chat.py

References

  • Sahoo et al., Simple and Effective Masked Diffusion Language Models (MDLM), NeurIPS 2024
  • Nie et al., Large Language Diffusion Models (LLaDA), 2025
  • Chang et al., MaskGIT: Masked Generative Image Transformers, CVPR 2022

License

MIT

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support