@Jaward on Hugging Face: "Let's breakdown the technical details in Microsoft's mind blowing Lifelike…"

Jaward

posted an update Apr 18

Post

3122

Let's breakdown the technical details in Microsoft's mind blowing Lifelike audio-driven talking faces framework - VASA and model VASA-1:

Summary of Summaries
- The paper introduces VASA, a framework for generating lifelike talking faces with appealing visual affective skills (VAS) from a single image and speech audio.
- Core innovations include a diffusion-based model for holistic generation of facial dynamics and head movements in an expressive, disentangled face latent space developed using video data..
- VASA-1 Generates high-quality 512x512 videos at up to 40 FPS with low latency.
- Supports real-time generation of lifelike, emotive talking faces.

Summary of Overall Framework:
- VASA generates facial dynamics and head motion in latent space, conditioned on audio and other signals
- Instead of directly generating video frames, it generates holistic facial dynamics and head motion in a latent space, conditioned on audio and optional signals.
- To achieve this, the framework uses a face encoder-decoder to extract appearance and identity features and train a Diffusion Transformer model to generate motion latent codes.

Technical Method Details:
Expressive and Disentangled Face Latent Space Construction:
- Based on 3D-AID face reenactment framework
- Decomposes face into 3D appearance volume, identity code, head pose,
and facial dynamics latents
- Uses encoders to extract these latent factors from face images.
- Applies additional losses to improve disentanglement:
- Pairwise head pose and facial dynamics transfer loss
- Face identity similarity loss for cross-identity pose/dynamics transfer

Holistic Facial Dynamics Generation with Diffusion Transformer:
- Represents all facial movements (lip, expression, gaze, etc.) as a single
latent sequence
- Applies a Diffusion Transformer model to generate the facial dynamics sequence.
- Diffusion Transformer trained with simplified denoising score matching objective.

Jaward

Apr 18

Official blog post: https://www.microsoft.com/en-us/research/project/vasa-1/

Jaward

Apr 18

The magic: a train pipeline that can “extract facial dynamics and head movements from real-life talking face videos”

DmitryRyumin

Apr 18

Oh, 🔥

🚀 Added to the Avatars Collection 🎭: https://huggingface.co/collections/DmitryRyumin/avatars-65df37cdf81fec13d4dbac36

ScottzModelz

Apr 18

•

edited Apr 18

So Glad to find this...
Also your "Big Five Personality Traits" collection is nice.. I've been interested in MBTI for a long time.

ScottzModelz

Apr 18

What's the closest Code to this? Will MS release the code it?

Jaward

Apr 19

Closest is SadTalker: https://github.com/OpenTalker/SadTalker
its holistic facial dynamic generation is limited to: only lipsync, head movement and eye blink.

I don't think MSF will release the VASA code, they will probably commercialize on it.

HDiffusion

Apr 19

This is really straddling the gap between "I want it local" and "I don't want anyone to have it local".

Join the conversation