Join the conversation

Join the community of Machine Learners and AI enthusiasts.

Sign Up
Jaward 
posted an update Apr 18
Post
3122
Let's breakdown the technical details in Microsoft's mind blowing Lifelike audio-driven talking faces framework - VASA and model VASA-1:

Summary of Summaries
- The paper introduces VASA, a framework for generating lifelike talking faces with appealing visual affective skills (VAS) from a single image and speech audio.
- Core innovations include a diffusion-based model for holistic generation of facial dynamics and head movements in an expressive, disentangled face latent space developed using video data..
- VASA-1 Generates high-quality 512x512 videos at up to 40 FPS with low latency.
- Supports real-time generation of lifelike, emotive talking faces.

Summary of Overall Framework:
- VASA generates facial dynamics and head motion in latent space, conditioned on audio and other signals
- Instead of directly generating video frames, it generates holistic facial dynamics and head motion in a latent space, conditioned on audio and optional signals.
- To achieve this, the framework uses a face encoder-decoder to extract appearance and identity features and train a Diffusion Transformer model to generate motion latent codes.

Technical Method Details:
Expressive and Disentangled Face Latent Space Construction:
- Based on 3D-AID face reenactment framework
- Decomposes face into 3D appearance volume, identity code, head pose,
and facial dynamics latents
- Uses encoders to extract these latent factors from face images.
- Applies additional losses to improve disentanglement:
- Pairwise head pose and facial dynamics transfer loss
- Face identity similarity loss for cross-identity pose/dynamics transfer

Holistic Facial Dynamics Generation with Diffusion Transformer:
- Represents all facial movements (lip, expression, gaze, etc.) as a single
latent sequence
- Applies a Diffusion Transformer model to generate the facial dynamics sequence.
- Diffusion Transformer trained with simplified denoising score matching objective.

The magic: a train pipeline that can “extract facial dynamics and head movements from real-life talking face videos”

Oh, 🔥

🚀 Added to the Avatars Collection 🎭: https://huggingface.co/collections/DmitryRyumin/avatars-65df37cdf81fec13d4dbac36

·

So Glad to find this...
Also your "Big Five Personality Traits" collection is nice.. I've been interested in MBTI for a long time.

What's the closest Code to this? Will MS release the code it?

·

Closest is SadTalker: https://github.com/OpenTalker/SadTalker
its holistic facial dynamic generation is limited to: only lipsync, head movement and eye blink.

I don't think MSF will release the VASA code, they will probably commercialize on it.

This is really straddling the gap between "I want it local" and "I don't want anyone to have it local".