Image-to-Video
shisheng7 nielsr HF staff commited on
Commit
f3c5c6d
1 Parent(s): cb78612

Add link to paper, link to Github, and pipeline tag (#3)

Browse files

- Add link to paper, link to Github, and pipeline tag (1d52267836c85539db84565c7fba59b11b5a901b)


Co-authored-by: Niels Rogge <nielsr@users.noreply.huggingface.co>

Files changed (1) hide show
  1. README.md +20 -1
README.md CHANGED
@@ -1,7 +1,26 @@
1
  ---
2
  license: mit
 
3
  ---
4
 
5
  ## Introduction
6
 
7
- We propose JoyVASA, a diffusion-based method for generating facial dynamics and head motion in audio-driven facial animation. Specifically, in the first stage, we introduce a decoupled facial representation framework that separates dynamic facial expressions from static 3D facial representations. This decoupling allows the system to generate longer videos by combining any static 3D facial representation with dynamic motion sequences. Then, in the second stage, a diffusion transformer is trained to generate motion sequences directly from audio cues, independent of character identity. Finally, a generator trained in the first stage uses the 3D facial representation and the generated motion sequences as inputs to render high-quality animations. With the decoupled facial representation and the identity-independent motion generation process, JoyVASA extends beyond human portraits to animate animal faces seamlessly. The model is trained on a hybrid dataset of private Chinese and public English data, enabling multilingual support. Experimental results validate the effectiveness of our approach. Future work will focus on improving real-time performance and refining expression control, further expanding the framework’s applications in portrait animation.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
  license: mit
3
+ pipeline_tag: image-to-video
4
  ---
5
 
6
  ## Introduction
7
 
8
+ We propose JoyVASA, a diffusion-based method for generating facial dynamics and head motion in audio-driven facial animation. Specifically, in the first stage, we introduce a decoupled facial representation framework that separates dynamic facial expressions from static 3D facial representations. This decoupling allows the system to generate longer videos by combining any static 3D facial representation with dynamic motion sequences. Then, in the second stage, a diffusion transformer is trained to generate motion sequences directly from audio cues, independent of character identity. Finally, a generator trained in the first stage uses the 3D facial representation and the generated motion sequences as inputs to render high-quality animations. With the decoupled facial representation and the identity-independent motion generation process, JoyVASA extends beyond human portraits to animate animal faces seamlessly. The model is trained on a hybrid dataset of private Chinese and public English data, enabling multilingual support. Experimental results validate the effectiveness of our approach. Future work will focus on improving real-time performance and refining expression control, further expanding the framework’s applications in portrait animation.
9
+
10
+ ## Usage
11
+
12
+ The code can be found at https://github.com/jdh-algo/JoyVASA.
13
+
14
+ ## Citation
15
+
16
+ ```bibtex
17
+ @misc{cao2024joyvasaportraitanimalimage,
18
+ title={JoyVASA: Portrait and Animal Image Animation with Diffusion-Based Audio-Driven Facial Dynamics and Head Motion Generation},
19
+ author={Xuyang Cao and Guoxin Wang and Sheng Shi and Jun Zhao and Yang Yao and Jintao Fei and Minyu Gao},
20
+ year={2024},
21
+ eprint={2411.09209},
22
+ archivePrefix={arXiv},
23
+ primaryClass={cs.CV},
24
+ url={https://arxiv.org/abs/2411.09209},
25
+ }
26
+ ```