Dance2Music β€” Music2Pose Transformer

Generate dancing skeleton animations from any audio file using a causal transformer trained on AIST++ dance videos.

Author: Emma Genthon Β· genthonemma@gmail.com
Best checkpoint: epoch 180 (pytorch_model.pt)

What it does

Given a music track, the model autoregressively predicts 33 MediaPipe-style body landmarks at 25 fps, then renders them as a skeleton video synced to the audio.

Audio (32 kHz) β†’ mel + onset + beat features β†’ Music2PoseTransformer β†’ poses (33Γ—3) β†’ skeleton video

Quick start

git clone https://huggingface.co/Emma5099/dance2music
cd dance2music
pip install -r requirements.txt

# Requires ffmpeg on PATH
python generate.py --audio your_song.wav --out dance.mp4

Python API

import soundfile as sf
import librosa
import torch

from audio_features import AUDIO_SR, audio_to_features
from inference import load_checkpoint, generate_poses

waveform, sr = sf.read("song.wav", dtype="float32")
if waveform.ndim > 1:
    waveform = waveform.mean(axis=1)
if sr != AUDIO_SR:
    waveform = librosa.resample(waveform, orig_sr=sr, target_sr=AUDIO_SR)

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model, ckpt = load_checkpoint("pytorch_model.pt", device)
poses = generate_poses(
    model, audio_to_features(waveform),
    ckpt["x_mean"], ckpt["x_std"],
    ckpt["y_mean"], ckpt["y_std"],
    device,
)
# poses.shape β†’ (T, 33, 3)

Model

Property Value
Architecture Causal transformer decoder with audio cross-attention
Hidden size 256
Layers / heads 6 / 8
Audio input 80 mel bands + onset strength + beat flag (82 dims/frame)
Pose output 33 landmarks Γ— 9 channels (xyz + velocity + acceleration), inference uses xyz only
Context window 60 frames (2.4 s rolling buffer)
Training epochs 180 (selected as best overall checkpoint)

Validation metrics (sample sBT_c01_v7)

Metric Value
MSE (xyz) 0.624
Motion speed 18.4% of ground truth
Balanced score 1.306

The model produces human-shaped skeletons with visible motion. Beat alignment is still limited and improves with more training data and epochs.

Files

File Description
pytorch_model.pt Best checkpoint (epoch 180) with weights + normalisation stats
model.py Music2PoseTransformer architecture
audio_features.py Mel/onset/beat feature extraction
inference.py Checkpoint loading and autoregressive generation
generate.py End-to-end CLI: audio β†’ skeleton MP4
config.json Model hyperparameters and metadata

Training (optional)

Training uses AIST++ pose sequences paired with audio chunks, weighted losses on position/velocity/acceleration, and scheduled sampling with a rolling 60-frame context. See the full project source for train.py, evaluate.py, and dataset tooling.

Requirements

  • Python 3.10+
  • PyTorch 2.x
  • ffmpeg (for video rendering and audio muxing)
  • ~26 MB disk for model weights

License

MIT

Citation

If you use this model, please cite AIST++ and this repository:

@misc{genthon2026dance2music,
  author = {Emma Genthon},
  title = {Dance2Music: Music-Conditioned Pose Generation},
  year = {2026},
  publisher = {Hugging Face},
  howpublished = {\url{https://huggingface.co/Emma5099/dance2music}}
}
Downloads last month
17
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support