RAG-Gesture: Retrieval-Augmented Co-Speech Gesture Generation

Pretrained weights for RAG-Gesture, a retrieval-augmented latent diffusion framework for generating semantically-aware 3D co-speech gestures on the BEAT2 (BEATX) dataset.

Paper: Retrieving Semantics from the Deep: an RAG Solution for Gesture Synthesis (CVPR 2025). Code: https://github.com/m-hamza-mughal/RAG-Gesture

What's in this repo

Path	Description
`vae/`	Four part-specific Transformer VAEs (upper / hands / face / lower+translation) used as the motion encoder/decoder. Each chunks 15 frames at 15 fps into one 512-d latent.
`diffusion/`	Trained RAG-Gesture diffusion transformer checkpoint (`base_beatx_len150fps15_finalweights/`).
`assets_deps/`	SMPL-X model files required by the dataloader and decoder for converting axis-angle poses to joints.

How the model works

RAG-Gesture denoises a body-part-factored latent sequence conditioned on text (BERT), audio (wav2vec2), and a speaker embedding. At inference, retrieval strategies — discourse-relation & gesture-type — pull semantically-matched motion clips from a database of training samples. The retrieved latents are injected into the diffusion process through DDIM inversion + per-step insertion guidance, biasing the denoiser toward the retrieved semantics without sacrificing the base model's fluency.

Key design choices:

4-VAE body-part tokenization (upper / hands / face / lower+translation) with a [sep] token between parts, all concatenated along the time axis.
Multi-conditional CFG mixing text-only and unconditional predictions with a coarse-to-fine scaling schedule.
Insertion guidance that runs a small inner gradient loop on the current x_t against the per-timestep inverted retrieval latent.

How to use

# 1. Clone the code
git clone https://github.com/m-hamza-mughal/RAG-Gesture
cd RAG-Gesture
pip install -r requirements.txt

# 2. Pull these weights + SMPL-X deps
python tools/download_weights.py

# 3. Pull the BEAT2 dataset + RAG-Gesture annotations
python tools/download_annotations.py

# 4. Run RAG-guided inference (discourse retrieval)
PYTHONPATH=".":$PYTHONPATH python tools/visualize.py \
    experiments/diffusion/base_beatx_len150fps15_finalweights/basegesture_len150_beat.py \
    experiments/diffusion/base_beatx_len150fps15_finalweights/epoch_64.pth \
    --retrieval_method discourse_guidance_test \
    --use_retrieval --use_inversion --use_insertion_guidance \
    --guidance_iters decreasing_till_25

See the GitHub README for training, evaluation, long-form synthesis, and ablation commands.

Citation

@InProceedings{mughal2024raggesture,
    title     = {Retrieving Semantics from the Deep: an RAG Solution for Gesture Synthesis},
    author    = {M. Hamza Mughal and Rishabh Dabral and Merel C. J. Scholman and Vera Demberg and Christian Theobalt},
    booktitle = {Computer Vision and Pattern Recognition (CVPR)},
    year      = {2025}
}

Acknowledgements

Built on EMAGE / PantoMatrix and ReMoDiffuse.

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support