RAG-Gesture: Retrieval-Augmented Co-Speech Gesture Generation
Pretrained weights for RAG-Gesture, a retrieval-augmented latent diffusion framework for generating semantically-aware 3D co-speech gestures on the BEAT2 (BEATX) dataset.
Paper: Retrieving Semantics from the Deep: an RAG Solution for Gesture Synthesis (CVPR 2025). Code: https://github.com/m-hamza-mughal/RAG-Gesture
What's in this repo
| Path | Description |
|---|---|
vae/ |
Four part-specific Transformer VAEs (upper / hands / face / lower+translation) used as the motion encoder/decoder. Each chunks 15 frames at 15 fps into one 512-d latent. |
diffusion/ |
Trained RAG-Gesture diffusion transformer checkpoint (base_beatx_len150fps15_finalweights/). |
assets_deps/ |
SMPL-X model files required by the dataloader and decoder for converting axis-angle poses to joints. |
How the model works
RAG-Gesture denoises a body-part-factored latent sequence conditioned on text (BERT), audio (wav2vec2), and a speaker embedding. At inference, retrieval strategies โ discourse-relation & gesture-type โ pull semantically-matched motion clips from a database of training samples. The retrieved latents are injected into the diffusion process through DDIM inversion + per-step insertion guidance, biasing the denoiser toward the retrieved semantics without sacrificing the base model's fluency.
Key design choices:
- 4-VAE body-part tokenization (upper / hands / face / lower+translation)
with a
[sep]token between parts, all concatenated along the time axis. - Multi-conditional CFG mixing text-only and unconditional predictions with a coarse-to-fine scaling schedule.
- Insertion guidance that runs a small inner gradient loop on the
current
x_tagainst the per-timestep inverted retrieval latent.
How to use
# 1. Clone the code
git clone https://github.com/m-hamza-mughal/RAG-Gesture
cd RAG-Gesture
pip install -r requirements.txt
# 2. Pull these weights + SMPL-X deps
python tools/download_weights.py
# 3. Pull the BEAT2 dataset + RAG-Gesture annotations
python tools/download_annotations.py
# 4. Run RAG-guided inference (discourse retrieval)
PYTHONPATH=".":$PYTHONPATH python tools/visualize.py \
experiments/diffusion/base_beatx_len150fps15_finalweights/basegesture_len150_beat.py \
experiments/diffusion/base_beatx_len150fps15_finalweights/epoch_64.pth \
--retrieval_method discourse_guidance_test \
--use_retrieval --use_inversion --use_insertion_guidance \
--guidance_iters decreasing_till_25
See the GitHub README for training, evaluation, long-form synthesis, and ablation commands.
Citation
@InProceedings{mughal2024raggesture,
title = {Retrieving Semantics from the Deep: an RAG Solution for Gesture Synthesis},
author = {M. Hamza Mughal and Rishabh Dabral and Merel C. J. Scholman and Vera Demberg and Christian Theobalt},
booktitle = {Computer Vision and Pattern Recognition (CVPR)},
year = {2025}
}
Acknowledgements
Built on EMAGE / PantoMatrix and ReMoDiffuse.