ARTalk: Speech-Driven 3D Head Animation via Autoregressive Model
Paper Project Website Code
Xuangeng Chu1
Nabarun Goswami1,
Ziteng Cui1,
Hanqin Wang1,
Tatsuya Harada1,2
1The University of Tokyo,
2RIKEN AIP
🔥 More results can be found in our Project Page. 🔥
Installation
Clone the project
git clone --recurse-submodules git@github.com:xg-chu/ARTalk.git
cd ARTalk
Build environment
I will prepare a new environment guide as soon as possible.
For now, please use GAGAvatar's environment.yml
and install gradio and other dependent libraries.
conda env create -f environment.yml
conda activate ARTalk
Install GAGAvatar Module (If you want to use realistic avatars)
git clone --recurse-submodules git@github.com:xg-chu/diff-gaussian-rasterization.git
pip install ./diff-gaussian-rasterization
rm -rf ./diff-gaussian-rasterization
Prepare resources
Prepare resources with:
bash ./build_resources.sh
Quick Start Guide
Using Gradio Interface
We provide a simple Gradio demo to demonstrate ARTalk's capabilities:
python inference.py --run_app
Command Line Usage
ARTalk can be used via command line:
python inference.py -a your_audio_path --shape_id your_apperance --style_id your_style_motion --clip_length 750
--shape_id
can be specified with mesh
or tracked real avatars stored in tracked.pt
.
--style_id
can be specified with the name of *.pt
stored in assets/style_motion
.
--clip_length
sets the maximum duration of the rendered video and can be adjusted as needed. Longer videos may take more time to render.
Track new real head avatar and new style motion
The file tracked.pt
is generated using GAGAvatar/inference.py
. Here I've included several examples of tracked avatars for quick testing.
The style motion is tracked with EMICA module in GAGAvatar_track
. Each contains 50*106
dimensional data. 50
is 2 seconds consecutive frames, 106
is 100
expression code and 6
pose code (base+jaw). Here I've included several examples of tracked style motion.
Training
This version modifies the VQVAE part compared to the paper version.
The training code and the paper version code are still in preparation and are expected to be released later.
Acknowledgements
We thank Lars Traaholt Vågnes and Emmanuel Iarussi from Simli for the insightful discussions! 🤗
The ARTalk logo was designed by Caihong Ning.
Some part of our work is built based on FLAME. We also thank the following projects for sharing their great work.
- GAGAvatar: https://github.com/xg-chu/GAGAvatar
- GPAvatar: https://github.com/xg-chu/GPAvatar
- FLAME: https://flame.is.tue.mpg.de
- EMICA: https://github.com/radekd91/inferno
Citation
If you find our work useful in your research, please consider citing:
@misc{
chu2025artalk,
title={ARTalk: Speech-Driven 3D Head Animation via Autoregressive Model},
author={Xuangeng Chu and Nabarun Goswami and Ziteng Cui and Hanqin Wang and Tatsuya Harada},
year={2025},
eprint={2502.20323},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2502.20323},
}
- Downloads last month
- 9