ChatAnyone: Stylized Real-time Portrait Video Generation with Hierarchical Motion Diffusion Model
Abstract
Real-time interactive video-chat portraits have been increasingly recognized as the future trend, particularly due to the remarkable progress made in text and voice chat technologies. However, existing methods primarily focus on real-time generation of head movements, but struggle to produce synchronized body motions that match these head actions. Additionally, achieving fine-grained control over the speaking style and nuances of facial expressions remains a challenge. To address these limitations, we introduce a novel framework for stylized real-time portrait video generation, enabling expressive and flexible video chat that extends from talking head to upper-body interaction. Our approach consists of the following two stages. The first stage involves efficient hierarchical motion diffusion models, that take both explicit and implicit motion representations into account based on audio inputs, which can generate a diverse range of facial expressions with stylistic control and synchronization between head and body movements. The second stage aims to generate portrait video featuring upper-body movements, including hand gestures. We inject explicit hand control signals into the generator to produce more detailed hand movements, and further perform face refinement to enhance the overall realism and expressiveness of the portrait video. Additionally, our approach supports efficient and continuous generation of upper-body portrait video in maximum 512 * 768 resolution at up to 30fps on 4090 GPU, supporting interactive video-chat in real-time. Experimental results demonstrate the capability of our approach to produce portrait videos with rich expressiveness and natural upper-body movements.
Community
I really like the concept and the idea but there's no source code to even test this out.
I'd be interested to see how they handle occlusions and different nationalities
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- Teller: Real-Time Streaming Audio-Driven Portrait Animation with Autoregressive Motion Generation (2025)
- Unlock Pose Diversity: Accurate and Efficient Implicit Keypoint-based Spatiotemporal Diffusion for Audio-driven Talking Portrait (2025)
- AudCast: Audio-Driven Human Video Generation by Cascaded Diffusion Transformers (2025)
- HunyuanPortrait: Implicit Condition Control for Enhanced Portrait Animation (2025)
- MVPortrait: Text-Guided Motion and Emotion Control for Multi-view Vivid Portrait Animation (2025)
- FLAP: Fully-controllable Audio-driven Portrait Video Generation through 3D head conditioned diffusion mode (2025)
- PC-Talk: Precise Facial Animation Control for Audio-Driven Talking Face Generation (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper