MoGenTS - Motion Generation Based on Spatial-Temporal Joint Modeling

Text-to-motion baseline integrated into the hftrainer Model Zoo. The runtime is self-contained under hftrainer.models.motion.mogents.network and does not import the original repository at inference time.

Task Text-to-Motion (T2M)
Bundle / Pipeline MoGenTSBundle / MoGenTSPipeline
Processed HF artifact ZeyuLing/hftrainer-mogents-humanml3d
Motion representation HumanML3D-263 (263-dim, 20 fps, 22 joints)
Tokenizer dual-stream RVQ-VAE: 1D auxiliary tokens + 2D spatial-temporal tokens
Generator 1D/2D MaskTransformers + 1D/2D ResidualTransformers
Text encoder CLIP ViT-B/32 (frozen)
Paper MoGenTS: Motion Generation based on Spatial-Temporal Joint Modeling, Yuan et al., NeurIPS 2024 - arXiv:2409.17686
Original code https://github.com/weihaosky/mogents

Weights

Self-contained hftrainer artifact:

Artifact Location Contents Status
MoGenTS HumanML3D ZeyuLing/hftrainer-mogents-humanml3d vq.safetensors + mask_aux.safetensors + mask_ts.safetensors + res_aux.safetensors + res_ts.safetensors + length_est.safetensors + clip.safetensors + mogents_config.json + Mean.npy / Std.npy public Hub artifact
local mirror checkpoints/mogents/humanml3d same layout optional local cache

Convert the official checkpoints into a self-contained hftrainer artifact:

python3 scripts/eval/convert_mogents_checkpoint.py \
    --weights_root logs \
    --length_root checkpoints \
    --out_dir checkpoints/mogents/humanml3d \
    --verify

Expected artifact layout:

checkpoints/mogents/humanml3d/
  mogents_config.json
  model_index.json
  vq.safetensors
  mask_aux.safetensors
  mask_ts.safetensors
  res_aux.safetensors
  res_ts.safetensors
  length_est.safetensors
  clip.safetensors
  Mean.npy
  Std.npy

Use

from hftrainer.pipelines.mogents import MoGenTSPipeline

pipe = MoGenTSPipeline.from_pretrained(
    "ZeyuLing/hftrainer-mogents-humanml3d",
    device="cuda",
)
motions = pipe.infer_t2m(
    ["a person walks forward then turns around"],
    [120],
)  # list of (T, 263)

For a local mirror:

pipe = MoGenTSPipeline.from_pretrained("checkpoints/mogents/humanml3d", device="cuda")

Motion Representation

MoGenTS natively generates HumanML3D-263 at 20 fps. For cross-model comparison with SMPL or MotionStreamer-272 methods, first generate the native 263-dim outputs and then use the validated bridge:

HumanML3D-263 -> SMPL motion_135 via IK refine-80 -> MotionStreamer-272

The bridge is a representation-conversion diagnostic. It should not be treated as the native MoGenTS paper metric space.


Evaluation

Generate under the official HumanML3D test protocol and score with the HumanML3D-263 evaluator:

python3 scripts/eval/mogents_t2m_h3d263.py \
    --model_path checkpoints/mogents/humanml3d \
    --out_dir outputs/evaluation/t2m/humanml3d_official_test/hml263/mogents_ts10_cfg4_rescfg5_seed0

python3 scripts/eval/verify_evaluators.py --which hml263 \
    --hml263-pred outputs/evaluation/t2m/humanml3d_official_test/hml263/mogents_ts10_cfg4_rescfg5_seed0 \
    --out-dir outputs/evaluation/t2m/humanml3d_official_test/hml263/mogents_ts10_cfg4_rescfg5_seed0/metrics

HumanML3D-263 evaluator (native space, n=3970)

Metric JSON: outputs/evaluation/t2m/humanml3d_official_test/hml263/mogents_ts10_cfg4_rescfg5_seed0/metrics/verify_hml263.json.

Metric hftrainer MoGenTS
FID down 0.0806
R-Precision Top-1 / 2 / 3 up 0.5219 / 0.7128 / 0.8056
Diversity -> 9.4063
MM-Dist down 2.9290
GT(real) R-Precision Top-1 / 2 / 3 0.5135 / 0.7108 / 0.8069
GT(real) Diversity / MM-Dist 9.4527 / 2.9323

SMPL motion_135 + MotionStreamer-272 evaluator

Convert the same HumanML3D test predictions to SMPL motion_135 and then to MotionStreamer-272:

NUM_GPUS=8 NUM_SHARDS=8 N_REPEATS=20 \
    bash scripts/eval/run_mogents_hml263_to_ms272_chain.sh

The restartable script runs the following stages with --skip-existing:

python3 scripts/eval/hml263_to_smpl_ik.py \
    --in-dir outputs/evaluation/t2m/humanml3d_official_test/hml263/mogents_ts10_cfg4_rescfg5_seed0 \
    --out-dir outputs/evaluation/t2m/humanml3d_official_test/motion135/mogents_ts10_cfg4_rescfg5_seed0_ik80 \
    --model-dir ref_repo/MDM/body_models \
    --source-fps 20 --target-fps 30 \
    --floor-align --refine-iters 80 --refine-lr 0.02 \
    --device cuda --skip-existing

python3 scripts/data/convert_motion135_to_h3d272.py \
    --in-dir outputs/evaluation/t2m/humanml3d_official_test/motion135/mogents_ts10_cfg4_rescfg5_seed0_ik80 \
    --out-dir outputs/evaluation/t2m/humanml3d_official_test/ms272/mogents_ts10_cfg4_rescfg5_seed0_ik80 \
    --workers 8 --skip-existing

python3 scripts/eval/verify_evaluators.py --which ms272 \
    --ms272-pred outputs/evaluation/t2m/humanml3d_official_test/ms272/mogents_ts10_cfg4_rescfg5_seed0_ik80 \
    --n-repeats 20 \
    --out-dir outputs/evaluation/t2m/humanml3d_official_test/ms272/mogents_ts10_cfg4_rescfg5_seed0_ik80/metrics

Metric JSON: outputs/evaluation/t2m/humanml3d_official_test/ms272/mogents_ts10_cfg4_rescfg5_seed0_ik80/metrics/verify_ms272.json.

Run summary: outputs/evaluation/t2m/humanml3d_official_test/ms272/mogents_ts10_cfg4_rescfg5_seed0_ik80/metrics/run_summary.json.

Metric MoGenTS HML263 -> SMPL135 -> MS272 MS272 GT(real)
FID down 113.0856 0.0
R-Precision Top-1 / 2 / 3 up 0.4764 / 0.6321 / 0.7099 0.7059 / 0.8569 / 0.9106
Diversity -> 25.3033 27.3692
MM-Dist down 19.4679 15.0066
Samples used 7392 7392
Missing predictions skipped 0 -

Bridge outputs contain 4012 HML263 predictions, 4012 SMPL motion_135 files, and 4012 MotionStreamer-272 files. The SMPL IK shard summaries report zero conversion failures and mean joint-fit MPJPE around 15.2-15.6 mm.

Implementation Notes

  • Architecture: MoGenTS generates a 1D auxiliary token stream and a 2D spatial-temporal token grid, then decodes both streams together with the dual RVQ-VAE.
  • Runtime package: hftrainer/models/motion/mogents/network/ contains only the inference-time model components from the MIT-licensed upstream code.
  • Artifact loading: MoGenTSBundle.from_pretrained consumes local/HF-style artifacts and stores CLIP once as clip.safetensors; raw upstream .tar checkpoints are supported only through explicit converter/debug paths.
  • Native representation: generated outputs are HumanML3D-263. Any MotionStreamer-272 or SMPL motion_135 comparison should be produced by the existing representation-conversion pipeline after generation.
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Paper for ZeyuLing/hftrainer-mogents-humanml3d