Block Diffusion Interpolates Between Autoregressive and Diffusion Language Models (ICLR 2025 Oral)

By Marianne Arriola, Aaron Gokaslan, Justin T Chiu, Zhihan Yang, Zhixuan Qi, Jiaqi Han, Subham Sekhar Sahoo, Volodymyr Kuleshov

We introduce BD3-LMs, a family of Block Discrete Denoising Diffusion Language Models that achieve SOTA likelihoods among diffusion models and enable generation of arbitrary-length sequences. BD3-LMs combine the strengths of autoregressive and diffusion language models by decomposing a token sequence into blocks and performing discrete diffusion within each block. By tuning the block size, we interpolate between autoregressive and diffusion models which introduces a trade-off between quality and sample efficiency. We propose a recipe of building effective BD3-LMs that includes an efficient training algorithm, estimators of gradient variance, and data-driven noise schedules to minimize the variance.

Model Description

This is a retrained baseline model from SEDD. Differently from Austin et. al, we train our SEDD baseline on OpenWebText without injecting BOS/EOS at the beginning/end of the training context. This allows us to analyze the lengths of generated samples at inference, without the artificial BOS/EOS injection confounding the length statistics.

How to use

See our GitHub README, where we provide sample scripts for training, likelihood evaluation, and generation.

Citation

@inproceedings{
arriola2025block,
title={Block Diffusion: Interpolating Between Autoregressive and Diffusion Language Models},
author={Marianne Arriola and Aaron Gokaslan and Justin T Chiu and Zhihan Yang and Zhixuan Qi and Jiaqi Han and Subham Sekhar Sahoo and Volodymyr Kuleshov},
booktitle={The Thirteenth International Conference on Learning Representations},
year={2025},
url={https://arxiv.org/abs/2503.09573}
}