MoGenTS - Motion Generation Based on Spatial-Temporal Joint Modeling

Text-to-motion baseline integrated into the hftrainer Model Zoo. The runtime is self-contained under hftrainer.models.motion.mogents.network and does not import the original repository at inference time.


Task	Text-to-Motion (T2M)
Bundle / Pipeline	`MoGenTSBundle` / `MoGenTSPipeline`
Processed HF artifact	`ZeyuLing/hftrainer-mogents-humanml3d`
Motion representation	HumanML3D-263 (263-dim, 20 fps, 22 joints)
Tokenizer	dual-stream RVQ-VAE: 1D auxiliary tokens + 2D spatial-temporal tokens
Generator	1D/2D MaskTransformers + 1D/2D ResidualTransformers
Text encoder	CLIP ViT-B/32 (frozen)
Paper	MoGenTS: Motion Generation based on Spatial-Temporal Joint Modeling, Yuan et al., NeurIPS 2024 - arXiv:2409.17686
Original code	https://github.com/weihaosky/mogents

Weights

Self-contained hftrainer artifact:

Artifact	Location	Contents	Status
MoGenTS HumanML3D	`ZeyuLing/hftrainer-mogents-humanml3d`	`vq.safetensors` + `mask_aux.safetensors` + `mask_ts.safetensors` + `res_aux.safetensors` + `res_ts.safetensors` + `length_est.safetensors` + `clip.safetensors` + `mogents_config.json` + `Mean.npy` / `Std.npy`	public Hub artifact
local mirror	`checkpoints/mogents/humanml3d`	same layout	optional local cache

Convert the official checkpoints into a self-contained hftrainer artifact:

python3 scripts/eval/convert_mogents_checkpoint.py \
    --weights_root logs \
    --length_root checkpoints \
    --out_dir checkpoints/mogents/humanml3d \
    --verify

Expected artifact layout:

checkpoints/mogents/humanml3d/
  mogents_config.json
  model_index.json
  vq.safetensors
  mask_aux.safetensors
  mask_ts.safetensors
  res_aux.safetensors
  res_ts.safetensors
  length_est.safetensors
  clip.safetensors
  Mean.npy
  Std.npy

Use

from hftrainer.pipelines.mogents import MoGenTSPipeline

pipe = MoGenTSPipeline.from_pretrained(
    "ZeyuLing/hftrainer-mogents-humanml3d",
    device="cuda",
)
motions = pipe.infer_t2m(
    ["a person walks forward then turns around"],
    [120],
)  # list of (T, 263)

For a local mirror:

pipe = MoGenTSPipeline.from_pretrained("checkpoints/mogents/humanml3d", device="cuda")

Motion Representation

MoGenTS natively generates HumanML3D-263 at 20 fps. For cross-model comparison with SMPL or MotionStreamer-272 methods, first generate the native 263-dim outputs and then use the validated bridge:

HumanML3D-263 -> SMPL motion_135 via IK refine-80 -> MotionStreamer-272

The bridge is a representation-conversion diagnostic. It should not be treated as the native MoGenTS paper metric space.

Evaluation

Generate under the official HumanML3D test protocol and score with the HumanML3D-263 evaluator:

python3 scripts/eval/mogents_t2m_h3d263.py \
    --model_path checkpoints/mogents/humanml3d \
    --out_dir outputs/evaluation/t2m/humanml3d_official_test/hml263/mogents_ts10_cfg4_rescfg5_seed0

python3 scripts/eval/verify_evaluators.py --which hml263 \
    --hml263-pred outputs/evaluation/t2m/humanml3d_official_test/hml263/mogents_ts10_cfg4_rescfg5_seed0 \
    --out-dir outputs/evaluation/t2m/humanml3d_official_test/hml263/mogents_ts10_cfg4_rescfg5_seed0/metrics

HumanML3D-263 evaluator (native space, n=3970)

Metric JSON: outputs/evaluation/t2m/humanml3d_official_test/hml263/mogents_ts10_cfg4_rescfg5_seed0/metrics/verify_hml263.json.

Metric	hftrainer MoGenTS
FID down	0.0806
R-Precision Top-1 / 2 / 3 up	0.5219 / 0.7128 / 0.8056
Diversity ->	9.4063
MM-Dist down	2.9290
GT(real) R-Precision Top-1 / 2 / 3	0.5135 / 0.7108 / 0.8069
GT(real) Diversity / MM-Dist	9.4527 / 2.9323

SMPL motion_135 + MotionStreamer-272 evaluator

Convert the same HumanML3D test predictions to SMPL motion_135 and then to MotionStreamer-272:

NUM_GPUS=8 NUM_SHARDS=8 N_REPEATS=20 \
    bash scripts/eval/run_mogents_hml263_to_ms272_chain.sh

The restartable script runs the following stages with --skip-existing:

python3 scripts/eval/hml263_to_smpl_ik.py \
    --in-dir outputs/evaluation/t2m/humanml3d_official_test/hml263/mogents_ts10_cfg4_rescfg5_seed0 \
    --out-dir outputs/evaluation/t2m/humanml3d_official_test/motion135/mogents_ts10_cfg4_rescfg5_seed0_ik80 \
    --model-dir ref_repo/MDM/body_models \
    --source-fps 20 --target-fps 30 \
    --floor-align --refine-iters 80 --refine-lr 0.02 \
    --device cuda --skip-existing

python3 scripts/data/convert_motion135_to_h3d272.py \
    --in-dir outputs/evaluation/t2m/humanml3d_official_test/motion135/mogents_ts10_cfg4_rescfg5_seed0_ik80 \
    --out-dir outputs/evaluation/t2m/humanml3d_official_test/ms272/mogents_ts10_cfg4_rescfg5_seed0_ik80 \
    --workers 8 --skip-existing

python3 scripts/eval/verify_evaluators.py --which ms272 \
    --ms272-pred outputs/evaluation/t2m/humanml3d_official_test/ms272/mogents_ts10_cfg4_rescfg5_seed0_ik80 \
    --n-repeats 20 \
    --out-dir outputs/evaluation/t2m/humanml3d_official_test/ms272/mogents_ts10_cfg4_rescfg5_seed0_ik80/metrics

Metric JSON: outputs/evaluation/t2m/humanml3d_official_test/ms272/mogents_ts10_cfg4_rescfg5_seed0_ik80/metrics/verify_ms272.json.

Run summary: outputs/evaluation/t2m/humanml3d_official_test/ms272/mogents_ts10_cfg4_rescfg5_seed0_ik80/metrics/run_summary.json.

Metric	MoGenTS HML263 -> SMPL135 -> MS272	MS272 GT(real)
FID down	113.0856	0.0
R-Precision Top-1 / 2 / 3 up	0.4764 / 0.6321 / 0.7099	0.7059 / 0.8569 / 0.9106
Diversity ->	25.3033	27.3692
MM-Dist down	19.4679	15.0066
Samples used	7392	7392
Missing predictions skipped	0	-

Bridge outputs contain 4012 HML263 predictions, 4012 SMPL motion_135 files, and 4012 MotionStreamer-272 files. The SMPL IK shard summaries report zero conversion failures and mean joint-fit MPJPE around 15.2-15.6 mm.

Implementation Notes

Architecture: MoGenTS generates a 1D auxiliary token stream and a 2D spatial-temporal token grid, then decodes both streams together with the dual RVQ-VAE.
Runtime package: hftrainer/models/motion/mogents/network/ contains only the inference-time model components from the MIT-licensed upstream code.
Artifact loading: MoGenTSBundle.from_pretrained consumes local/HF-style artifacts and stores CLIP once as clip.safetensors; raw upstream .tar checkpoints are supported only through explicit converter/debug paths.
Native representation: generated outputs are HumanML3D-263. Any MotionStreamer-272 or SMPL motion_135 comparison should be produced by the existing representation-conversion pipeline after generation.

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

Other

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Paper for ZeyuLing/hftrainer-mogents-humanml3d

MoGenTS: Motion Generation based on Spatial-Temporal Joint Modeling

Paper • 2409.17686 • Published Dec 18, 2024