Diffusion Large Language Models for Visual Speech Recognition

Paper checkpoints for DLLM-VSR — adapting the Dream-7B discrete-diffusion LLM to Visual Speech Recognition (VSR) on LRS3.

Paper: arxiv.org/abs/2605.28456
Code: github.com/jh-y/dllm-vsr
Authors: Jeong Hun Yeo, Chae Won Kim, Hyeongseop Rha, Yong Man Ro

Path	Description	Size
`usr2/dream_stage2/`	USR 2.0 + Dream-7B stage 2 (LoRA + adapter)	117 MB
`usr2/len_pred/`	Length predictor for USR 2.0 features	8.2 MB
`avhubert/dream_stage2/`	AV-HuBERT + Dream-7B stage 2	102 MB
`avhubert/len_pred/`	Length predictor for AV-HuBERT features	8.0 MB

Each dream_stage2/ holds trainable_model.safetensors (LoRA adapters + visual-feature projector). Each len_pred/ holds trainable_model.pt (small Transformer over visual features).

Note: Visual encoder weights (USR 2.0 Huge, AV-HuBERT Large) are not redistributed here. Download them from the original repos:

AV-HuBERT: https://github.com/facebookresearch/av_hubert
USR 2.0: https://github.com/ahaliassos/usr2

Results on LRS3 test (WER, %)

All entries are trained on LRS3 (433h) only.

Decoding	USR 2.0	AV-HuBERT
Direct	20.5	23.1
Length-guided candidate decoding (paper main)	19.5	21.9
Oracle-length (upper-bound reference)	17.7	20.2

Usage

huggingface-cli download jh-y/dllm-vsr --local-dir ckpt

Then follow the code repo's README for environment setup, preprocessing (auto-avsr pipeline), and inference scripts.

Citation

@article{yeo2026dllmvsr,
  title={Diffusion Large Language Models for Visual Speech Recognition},
  author={Yeo, Jeong Hun and Kim, Chae Won and Rha, Hyeongseop and Ro, Yong Man},
  journal={arXiv preprint arXiv:2605.28456},
  year={2026}
}

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Paper for jh-y/dllm-vsr

Diffusion Large Language Models for Visual Speech Recognition

Paper • 2605.28456 • Published 2 days ago

jh-y
/

dllm-vsr

Diffusion Large Language Models for Visual Speech Recognition

Contents

Results on LRS3 test (WER, %)

Usage

Citation

Paper for jh-y/dllm-vsr

Diffusion Large Language Models for Visual Speech Recognition