Instructions to use HKUSTAudio/Talker-T2AV with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use HKUSTAudio/Talker-T2AV with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-to-speech", model="HKUSTAudio/Talker-T2AV")# Load model directly from transformers import AutoModel model = AutoModel.from_pretrained("HKUSTAudio/Talker-T2AV", dtype="auto") - Notebooks
- Google Colab
- Kaggle
Talker-T2AV
Joint Talking Audio-Video Generation with Autoregressive Diffusion Modeling
Paper (arXiv 2604.23586) · Code (GitHub) · Samples
This repository hosts the pretrained weights for the paper "Talker-T2AV: Joint Talking Audio-Video Generation with Autoregressive Diffusion Modeling".
Contents
talker-t2av/
model.safetensors ← AR backbone (Qwen3-0.6B) + dual diffusion heads
+ Patch Transformer Encoder + Stop Predictor
config.json
chat_template.jinja
tokenizer.json
tokenizer_config.json
whisperx-vae/
model.ckpt ← WhisperX-VAE audio autoencoder
(32-d, 25 Hz; Whisper-Large-v3 encoder + DAC backbone)
For the LIA-X video motion autoencoder (40-d motion, 25 Hz), the model
code is vendored under lia_x/ in the GitHub repo — only the
lia-x.pt weight file needs to be fetched separately from
wyhsirius/LIA-X. The WavLM-Large
fine-tuned speaker encoder (wavlm_large_finetune.pth) similarly ships
its code under speaker_verification/; only the .pth weights need to
be obtained from
Microsoft UniSpeech.
Quickstart
git clone https://github.com/zhenye234/Talker-T2AV.git
cd Talker-T2AV
# put the HF-hosted weights in place
huggingface-cli download HKUSTAudio/Talker-T2AV --local-dir ./hf_weights
export CHECKPOINT_DIR="$(pwd)/hf_weights/talker-t2av"
export WHISPERVAE_CKPT="$(pwd)/hf_weights/whisperx-vae/model.ckpt"
# the two extra weight files (code already vendored — no need to clone the repos)
export LIAX_CKPT=/path/to/lia-x.pt
export WAVLM_CKPT=/path/to/wavlm_large_finetune.pth
python infer.py
See the GitHub README for full installation and reproduction instructions.
Citation
@misc{ye2026talkert2avjointtalkingaudiovideo,
title={Talker-T2AV: Joint Talking Audio-Video Generation with Autoregressive Diffusion Modeling},
author={Zhen Ye and Xu Tan and Aoxiong Yin and Hongzhan Lin and Guangyan Zhang and Peiwen Sun and Yiming Li and Chi-Min Chan and Wei Ye and Shikun Zhang and Wei Xue},
year={2026},
eprint={2604.23586},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2604.23586},
}