HandX — Diffusion Text-to-Motion Checkpoints

Diffusion checkpoints for HandX: Scaling Bimanual Motion and Interaction Generation (CVPR 2026). They generate two-hand motion from text (separate text branches for the left hand, right hand, and their interaction), using an MDM-style diffusion model with a frozen T5-base text encoder.

📄 Paper: https://arxiv.org/abs/2603.28766
📦 Dataset: https://huggingface.co/datasets/alexzhang598/HandX

Checkpoints

Folder	Decoder layers	latent_dim
`layers4`	4	256
`layers8`	8	512
`layers12`	12	512 (best model in the paper)

Each folder has model.pt (weights) and config.yaml.

Loading

import torch
from huggingface_hub import hf_hub_download
from omegaconf import OmegaConf
# run from the `diffusion/` directory of the HandX repo
from src.diffusion.utils.model_utils import create_model_and_diffusion

variant = "layers12"
cfg = OmegaConf.load(hf_hub_download("alexzhang598/HandX-diffusion", f"{variant}/config.yaml"))
model, diffusion = create_model_and_diffusion(cfg.model)
sd = torch.load(hf_hub_download("alexzhang598/HandX-diffusion", f"{variant}/model.pt"),
                map_location="cpu")["state_dict"]
model.load_state_dict(sd, strict=False)  # missing keys are the frozen T5 encoder (loaded from t5-base)

The checkpoints load with a standard load_state_dict(..., strict=False); the only missing keys are the frozen T5 weights, restored from t5-base at construction.

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Space using alexzhang598/HandX-diffusion 1

Paper for alexzhang598/HandX-diffusion

HandX: Scaling Bimanual Motion and Interaction Generation

Paper • 2603.28766 • Published Mar 30 • 12