ARTalk: Speech-Driven 3D Head Animation via Autoregressive Model

Paper Project Website Code

Xuangeng Chu¹ Nabarun Goswami¹, Ziteng Cui¹, Hanqin Wang¹, Tatsuya Harada^1,2
¹The University of Tokyo, ²RIKEN AIP

ARTalk generates realistic 3D head motions (lip sync, blinking, expressions, head poses) from audio.
🔥 More results can be found in our Project Page. 🔥

Installation

Clone the project

git clone --recurse-submodules git@github.com:xg-chu/ARTalk.git
cd ARTalk

Build environment

I will prepare a new environment guide as soon as possible.

For now, please use GAGAvatar's environment.yml and install gradio and other dependent libraries.

conda env create -f environment.yml
conda activate ARTalk

Install GAGAvatar Module (If you want to use realistic avatars)

git clone --recurse-submodules git@github.com:xg-chu/diff-gaussian-rasterization.git
pip install ./diff-gaussian-rasterization
rm -rf ./diff-gaussian-rasterization

Prepare resources

Prepare resources with:

bash ./build_resources.sh

Quick Start Guide

Using Gradio Interface

We provide a simple Gradio demo to demonstrate ARTalk's capabilities:

python inference.py --run_app

Command Line Usage

ARTalk can be used via command line:

python inference.py -a your_audio_path --shape_id your_apperance --style_id your_style_motion --clip_length 750

--shape_id can be specified with mesh or tracked real avatars stored in tracked.pt.

--style_id can be specified with the name of *.pt stored in assets/style_motion.

--clip_length sets the maximum duration of the rendered video and can be adjusted as needed. Longer videos may take more time to render.

Track new real head avatar and new style motion

The file tracked.pt is generated using GAGAvatar/inference.py. Here I've included several examples of tracked avatars for quick testing.

The style motion is tracked with EMICA module in GAGAvatar_track . Each contains 50*106 dimensional data. 50 is 2 seconds consecutive frames, 106 is 100 expression code and 6 pose code (base+jaw). Here I've included several examples of tracked style motion.

Training

This version modifies the VQVAE part compared to the paper version.

The training code and the paper version code are still in preparation and are expected to be released later.

Acknowledgements

We thank Lars Traaholt Vågnes and Emmanuel Iarussi from Simli for the insightful discussions! 🤗

The ARTalk logo was designed by Caihong Ning.

Some part of our work is built based on FLAME. We also thank the following projects for sharing their great work.

GAGAvatar: https://github.com/xg-chu/GAGAvatar
GPAvatar: https://github.com/xg-chu/GPAvatar
FLAME: https://flame.is.tue.mpg.de
EMICA: https://github.com/radekd91/inferno

Citation

If you find our work useful in your research, please consider citing:

@misc{
    chu2025artalk,
    title={ARTalk: Speech-Driven 3D Head Animation via Autoregressive Model}, 
    author={Xuangeng Chu and Nabarun Goswami and Ziteng Cui and Hanqin Wang and Tatsuya Harada},
    year={2025},
    eprint={2502.20323},
    archivePrefix={arXiv},
    primaryClass={cs.CV},
    url={https://arxiv.org/abs/2502.20323}, 
}