๐ญ SadTalker๏ผ Learning Realistic 3D Motion Coefficients for Stylized Audio-Driven Single Image Talking Face Animation
Yu Guo1 โ Ying Shan 2 โ Fei Wang 1 โ
CVPR 2023
TL;DR: A realistic and stylized talking head video generation method from a single image and audio.
๐ Changelog
2023.03.22: Launch new feature: generating the 3d face animation from a single image. New applications about it will be updated.
2023.03.22: Launch new feature:
still mode
, where only a small head pose will be produced viapython inference.py --still
.2023.03.18: Support
expression intensity
, now you can change the intensity of the generated motion:python inference.py --expression_scale 1.3 (some value > 1)
.2023.03.18: Reconfig the data folders, now you can download the checkpoint automatically using
bash scripts/download_models.sh
.2023.03.18: We have offically integrate the GFPGAN for face enhancement, using
python inference.py --enhancer gfpgan
for better visualization performance.2023.03.14: Specify the version of package
joblib
to remove the errors in usinglibrosa
, is online!Previous Changelogs
- 2023.03.06 Solve some bugs in code and errors in installation
- 2023.03.03 Release the test code for audio-driven single image animation!
- 2023.02.28 SadTalker has been accepted by CVPR 2023!
๐ผ Pipeline
๐ง TODO
- Generating 2D face from a single Image.
- Generating 3D face from Audio.
- Generating 4D free-view talking examples from audio and a single image.
- Gradio/Colab Demo.
- Full body/image Generation.
- training code of each componments.
- Audio-driven Anime Avatar.
- interpolate ChatGPT for a conversation demo ๐ค
- integrade with stable-diffusion-web-ui. (stay tunning!)
https://user-images.githubusercontent.com/4397546/222513483-89161f58-83d0-40e4-8e41-96c32b47bd4e.mp4
๐ฎ Inference Demo!
Dependence Installation
CLICK ME
git clone https://github.com/Winfredy/SadTalker.git
cd SadTalker
conda create -n sadtalker python=3.8
source activate sadtalker
pip install torch==1.12.1+cu113 torchvision==0.13.1+cu113 torchaudio==0.12.1 --extra-index-url https://download.pytorch.org/whl/cu113
conda install ffmpeg
pip install dlib-bin # [dlib-bin is much faster than dlib installation] conda install dlib
pip install -r requirements.txt
### install gpfgan for enhancer
pip install git+https://github.com/TencentARC/GFPGAN
Trained Models
CLICK ME
You can run the following script to put all the models in the right place.
bash scripts/download_models.sh
OR download our pre-trained model from google drive or our github release page, and then, put it in ./checkpoints.
Model | Description |
---|---|
checkpoints/auido2exp_00300-model.pth | Pre-trained ExpNet in Sadtalker. |
checkpoints/auido2pose_00140-model.pth | Pre-trained PoseVAE in Sadtalker. |
checkpoints/mapping_00229-model.pth.tar | Pre-trained MappingNet in Sadtalker. |
checkpoints/facevid2vid_00189-model.pth.tar | Pre-trained face-vid2vid model from the reappearance of face-vid2vid. |
checkpoints/epoch_20.pth | Pre-trained 3DMM extractor in Deep3DFaceReconstruction. |
checkpoints/wav2lip.pth | Highly accurate lip-sync model in Wav2lip. |
checkpoints/shape_predictor_68_face_landmarks.dat | Face landmark model used in dilb. |
checkpoints/BFM | 3DMM library file. |
checkpoints/hub | Face detection models used in face alignment. |
Generating 2D face from a single Image
python inference.py --driven_audio <audio.wav> \
--source_image <video.mp4 or picture.png> \
--batch_size <default equals 2, a larger run faster> \
--expression_scale <default is 1.0, a larger value will make the motion stronger> \
--result_dir <a file to store results> \
--enhancer <default is None, you can choose gfpgan or RestoreFormer>
basic | w/ still mode | w/ exp_scale 1.3 | w/ gfpgan |
---|---|---|---|
Kindly ensure to activate the audio as the default audio playing is incompatible with GitHub.
Generating 3D face from Audio
Input | Animated 3d face |
---|---|
Kindly ensure to activate the audio as the default audio playing is incompatible with GitHub.
More details to generate the 3d face can be founded here
Generating 4D free-view talking examples from audio and a single image
We use camera_yaw
, camera_pitch
, camera_roll
to control camera pose. For example, --camera_yaw -20 30 10
means the camera yaw degree changes from -20 to 30 and then changes from 30 to 10.
python inference.py --driven_audio <audio.wav> \
--source_image <video.mp4 or picture.png> \
--result_dir <a file to store results> \
--camera_yaw -20 30 10
๐ Citation
If you find our work useful in your research, please consider citing:
@article{zhang2022sadtalker,
title={SadTalker: Learning Realistic 3D Motion Coefficients for Stylized Audio-Driven Single Image Talking Face Animation},
author={Zhang, Wenxuan and Cun, Xiaodong and Wang, Xuan and Zhang, Yong and Shen, Xi and Guo, Yu and Shan, Ying and Wang, Fei},
journal={arXiv preprint arXiv:2211.12194},
year={2022}
}
๐ Acknowledgements
Facerender code borrows heavily from zhanglonghao's reproduction of face-vid2vid and PIRender. We thank the authors for sharing their wonderful code. In training process, We also use the model from Deep3DFaceReconstruction and Wav2lip. We thank for their wonderful work.
๐ฅ Related Works
- StyleHEAT: One-Shot High-Resolution Editable Talking Face Generation via Pre-trained StyleGAN (ECCV 2022)
- CodeTalker: Speech-Driven 3D Facial Animation with Discrete Motion Prior (CVPR 2023)
- VideoReTalking: Audio-based Lip Synchronization for Talking Head Video Editing In the Wild (SIGGRAPH Asia 2022)
- DPE: Disentanglement of Pose and Expression for General Video Portrait Editing (CVPR 2023)
- 3D GAN Inversion with Facial Symmetry Prior (CVPR 2023)
- T2M-GPT: Generating Human Motion from Textual Descriptions with Discrete Representations (CVPR 2023)
๐ข Disclaimer
This is not an official product of Tencent. This repository can only be used for personal/research/non-commercial purposes.