Summary of Summaries
- The paper introduces VASA, a framework for generating lifelike talking faces with appealing visual affective skills (VAS) from a single image and speech audio.
- Core innovations include a diffusion-based model for holistic generation of facial dynamics and head movements in an expressive, disentangled face latent space developed using video data..
- VASA-1 Generates high-quality 512x512 videos at up to 40 FPS with low latency.
- Supports real-time generation of lifelike, emotive talking faces.
Summary of Overall Framework:
- VASA generates facial dynamics and head motion in latent space, conditioned on audio and other signals
- Instead of directly generating video frames, it generates holistic facial dynamics and head motion in a latent space, conditioned on audio and optional signals.
- To achieve this, the framework uses a face encoder-decoder to extract appearance and identity features and train a Diffusion Transformer model to generate motion latent codes.
Technical Method Details:
Expressive and Disentangled Face Latent Space Construction:
- Based on 3D-AID face reenactment framework
- Decomposes face into 3D appearance volume, identity code, head pose,
and facial dynamics latents
- Uses encoders to extract these latent factors from face images.
- Applies additional losses to improve disentanglement:
- Pairwise head pose and facial dynamics transfer loss
- Face identity similarity loss for cross-identity pose/dynamics transfer
Holistic Facial Dynamics Generation with Diffusion Transformer:
- Represents all facial movements (lip, expression, gaze, etc.) as a single
latent sequence
- Applies a Diffusion Transformer model to generate the facial dynamics sequence.
- Diffusion Transformer trained with simplified denoising score matching objective.