MoGenTS - Motion Generation Based on Spatial-Temporal Joint Modeling
Text-to-motion baseline integrated into the hftrainer Model Zoo. The runtime is
self-contained under hftrainer.models.motion.mogents.network and does not
import the original repository at inference time.
| Task | Text-to-Motion (T2M) |
| Bundle / Pipeline | MoGenTSBundle / MoGenTSPipeline |
| Processed HF artifact | ZeyuLing/hftrainer-mogents-humanml3d |
| Motion representation | HumanML3D-263 (263-dim, 20 fps, 22 joints) |
| Tokenizer | dual-stream RVQ-VAE: 1D auxiliary tokens + 2D spatial-temporal tokens |
| Generator | 1D/2D MaskTransformers + 1D/2D ResidualTransformers |
| Text encoder | CLIP ViT-B/32 (frozen) |
| Paper | MoGenTS: Motion Generation based on Spatial-Temporal Joint Modeling, Yuan et al., NeurIPS 2024 - arXiv:2409.17686 |
| Original code | https://github.com/weihaosky/mogents |
Weights
Self-contained hftrainer artifact:
| Artifact | Location | Contents | Status |
|---|---|---|---|
| MoGenTS HumanML3D | ZeyuLing/hftrainer-mogents-humanml3d |
vq.safetensors + mask_aux.safetensors + mask_ts.safetensors + res_aux.safetensors + res_ts.safetensors + length_est.safetensors + clip.safetensors + mogents_config.json + Mean.npy / Std.npy |
public Hub artifact |
| local mirror | checkpoints/mogents/humanml3d |
same layout | optional local cache |
Convert the official checkpoints into a self-contained hftrainer artifact:
python3 scripts/eval/convert_mogents_checkpoint.py \
--weights_root logs \
--length_root checkpoints \
--out_dir checkpoints/mogents/humanml3d \
--verify
Expected artifact layout:
checkpoints/mogents/humanml3d/
mogents_config.json
model_index.json
vq.safetensors
mask_aux.safetensors
mask_ts.safetensors
res_aux.safetensors
res_ts.safetensors
length_est.safetensors
clip.safetensors
Mean.npy
Std.npy
Use
from hftrainer.pipelines.mogents import MoGenTSPipeline
pipe = MoGenTSPipeline.from_pretrained(
"ZeyuLing/hftrainer-mogents-humanml3d",
device="cuda",
)
motions = pipe.infer_t2m(
["a person walks forward then turns around"],
[120],
) # list of (T, 263)
For a local mirror:
pipe = MoGenTSPipeline.from_pretrained("checkpoints/mogents/humanml3d", device="cuda")
Motion Representation
MoGenTS natively generates HumanML3D-263 at 20 fps. For cross-model comparison with SMPL or MotionStreamer-272 methods, first generate the native 263-dim outputs and then use the validated bridge:
HumanML3D-263 -> SMPL motion_135 via IK refine-80 -> MotionStreamer-272
The bridge is a representation-conversion diagnostic. It should not be treated as the native MoGenTS paper metric space.
Evaluation
Generate under the official HumanML3D test protocol and score with the HumanML3D-263 evaluator:
python3 scripts/eval/mogents_t2m_h3d263.py \
--model_path checkpoints/mogents/humanml3d \
--out_dir outputs/evaluation/t2m/humanml3d_official_test/hml263/mogents_ts10_cfg4_rescfg5_seed0
python3 scripts/eval/verify_evaluators.py --which hml263 \
--hml263-pred outputs/evaluation/t2m/humanml3d_official_test/hml263/mogents_ts10_cfg4_rescfg5_seed0 \
--out-dir outputs/evaluation/t2m/humanml3d_official_test/hml263/mogents_ts10_cfg4_rescfg5_seed0/metrics
HumanML3D-263 evaluator (native space, n=3970)
Metric JSON:
outputs/evaluation/t2m/humanml3d_official_test/hml263/mogents_ts10_cfg4_rescfg5_seed0/metrics/verify_hml263.json.
| Metric | hftrainer MoGenTS |
|---|---|
| FID down | 0.0806 |
| R-Precision Top-1 / 2 / 3 up | 0.5219 / 0.7128 / 0.8056 |
| Diversity -> | 9.4063 |
| MM-Dist down | 2.9290 |
| GT(real) R-Precision Top-1 / 2 / 3 | 0.5135 / 0.7108 / 0.8069 |
| GT(real) Diversity / MM-Dist | 9.4527 / 2.9323 |
SMPL motion_135 + MotionStreamer-272 evaluator
Convert the same HumanML3D test predictions to SMPL motion_135 and then to
MotionStreamer-272:
NUM_GPUS=8 NUM_SHARDS=8 N_REPEATS=20 \
bash scripts/eval/run_mogents_hml263_to_ms272_chain.sh
The restartable script runs the following stages with --skip-existing:
python3 scripts/eval/hml263_to_smpl_ik.py \
--in-dir outputs/evaluation/t2m/humanml3d_official_test/hml263/mogents_ts10_cfg4_rescfg5_seed0 \
--out-dir outputs/evaluation/t2m/humanml3d_official_test/motion135/mogents_ts10_cfg4_rescfg5_seed0_ik80 \
--model-dir ref_repo/MDM/body_models \
--source-fps 20 --target-fps 30 \
--floor-align --refine-iters 80 --refine-lr 0.02 \
--device cuda --skip-existing
python3 scripts/data/convert_motion135_to_h3d272.py \
--in-dir outputs/evaluation/t2m/humanml3d_official_test/motion135/mogents_ts10_cfg4_rescfg5_seed0_ik80 \
--out-dir outputs/evaluation/t2m/humanml3d_official_test/ms272/mogents_ts10_cfg4_rescfg5_seed0_ik80 \
--workers 8 --skip-existing
python3 scripts/eval/verify_evaluators.py --which ms272 \
--ms272-pred outputs/evaluation/t2m/humanml3d_official_test/ms272/mogents_ts10_cfg4_rescfg5_seed0_ik80 \
--n-repeats 20 \
--out-dir outputs/evaluation/t2m/humanml3d_official_test/ms272/mogents_ts10_cfg4_rescfg5_seed0_ik80/metrics
Metric JSON:
outputs/evaluation/t2m/humanml3d_official_test/ms272/mogents_ts10_cfg4_rescfg5_seed0_ik80/metrics/verify_ms272.json.
Run summary:
outputs/evaluation/t2m/humanml3d_official_test/ms272/mogents_ts10_cfg4_rescfg5_seed0_ik80/metrics/run_summary.json.
| Metric | MoGenTS HML263 -> SMPL135 -> MS272 | MS272 GT(real) |
|---|---|---|
| FID down | 113.0856 | 0.0 |
| R-Precision Top-1 / 2 / 3 up | 0.4764 / 0.6321 / 0.7099 | 0.7059 / 0.8569 / 0.9106 |
| Diversity -> | 25.3033 | 27.3692 |
| MM-Dist down | 19.4679 | 15.0066 |
| Samples used | 7392 | 7392 |
| Missing predictions skipped | 0 | - |
Bridge outputs contain 4012 HML263 predictions, 4012 SMPL motion_135 files,
and 4012 MotionStreamer-272 files. The SMPL IK shard summaries report zero
conversion failures and mean joint-fit MPJPE around 15.2-15.6 mm.
Implementation Notes
- Architecture: MoGenTS generates a 1D auxiliary token stream and a 2D spatial-temporal token grid, then decodes both streams together with the dual RVQ-VAE.
- Runtime package:
hftrainer/models/motion/mogents/network/contains only the inference-time model components from the MIT-licensed upstream code. - Artifact loading:
MoGenTSBundle.from_pretrainedconsumes local/HF-style artifacts and stores CLIP once asclip.safetensors; raw upstream.tarcheckpoints are supported only through explicit converter/debug paths. - Native representation: generated outputs are HumanML3D-263. Any
MotionStreamer-272 or SMPL
motion_135comparison should be produced by the existing representation-conversion pipeline after generation.