Talker-T2AV

Joint Talking Audio-Video Generation with Autoregressive Diffusion Modeling

Paper (arXiv 2604.23586) · Code (GitHub) · Samples

This repository hosts the pretrained weights for the paper "Talker-T2AV: Joint Talking Audio-Video Generation with Autoregressive Diffusion Modeling".

talker-t2av/
  model.safetensors            ← AR backbone (Qwen3-0.6B) + dual diffusion heads
                                  + Patch Transformer Encoder + Stop Predictor
  config.json
  chat_template.jinja
  tokenizer.json
  tokenizer_config.json

whisperx-vae/
  model.ckpt                   ← WhisperX-VAE audio autoencoder
                                  (32-d, 25 Hz; Whisper-Large-v3 encoder + DAC backbone)

For the LIA-X video motion autoencoder (40-d motion, 25 Hz), the model code is vendored under lia_x/ in the GitHub repo — only the lia-x.pt weight file needs to be fetched separately from wyhsirius/LIA-X. The WavLM-Large fine-tuned speaker encoder (wavlm_large_finetune.pth) similarly ships its code under speaker_verification/; only the .pth weights need to be obtained from Microsoft UniSpeech.

Quickstart

git clone https://github.com/zhenye234/Talker-T2AV.git
cd Talker-T2AV

# put the HF-hosted weights in place
huggingface-cli download HKUSTAudio/Talker-T2AV --local-dir ./hf_weights
export CHECKPOINT_DIR="$(pwd)/hf_weights/talker-t2av"
export WHISPERVAE_CKPT="$(pwd)/hf_weights/whisperx-vae/model.ckpt"

# the two extra weight files (code already vendored — no need to clone the repos)
export LIAX_CKPT=/path/to/lia-x.pt
export WAVLM_CKPT=/path/to/wavlm_large_finetune.pth

python infer.py

See the GitHub README for full installation and reproduction instructions.

Citation

@misc{ye2026talkert2avjointtalkingaudiovideo,
      title={Talker-T2AV: Joint Talking Audio-Video Generation with Autoregressive Diffusion Modeling}, 
      author={Zhen Ye and Xu Tan and Aoxiong Yin and Hongzhan Lin and Guangyan Zhang and Peiwen Sun and Yiming Li and Chi-Min Chan and Wei Ye and Shikun Zhang and Wei Xue},
      year={2026},
      eprint={2604.23586},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2604.23586}, 
}