QtMeshEditor Text-to-Motion (experimental, #411)

A small, experimental from-scratch text-to-motion model for QtMeshEditor. Given a text prompt (an action keyword), it generates a 40-frame, 22-joint canonical skeletal clip that QtMeshEditor retargets onto an arbitrary humanoid rig.

Status: experimental

Render-verified quality is action-dependent: locomotion (e.g. walk) is coherent; gestures (wave, run) can drift in the last few frames. The shipped default in QtMeshEditor is the deterministic template-clip retarget; this model is an opt-in (--model / GUI checkbox / MCP model:true) that falls back to the template automatically when unavailable or out of vocabulary.

Training data — permissive only

Trained from scratch on clean, dynamic, single-action windows mined from:

CMU MoCap (commercial-OK) — the bulk.
Quaternius Universal Animation Library (CC0) — supplementary.

AMASS / HumanML3D / KIT-ML were excluded (non-commercial). Idle/near-static and multi-action-labelled windows were filtered out (they dominated the raw set and caused pose collapse).

Architecture

6D-rotation representation, cross-attention decoder with a delta-integration (cumsum) head for temporal continuity, CVAE latent, balanced sampling. ~7.6M params. Exports to ONNX (one forward pass).

I/O contract

input  "tokens" float32 [1, V]   one-hot over the fixed action vocab (see t2m-vocab.json)
input  "seed"   float32 [1, Z]   latent (zeros = mean clip)
output "motion" float32 [1, T, C]  C = 22*10 per-joint [tx,ty,tz, qx,qy,qz,qw, sx,sy,sz]

t2m-vocab.json ships the {vocab, Z, T, C, J, joints} the host needs to build the input and interpret the output.

Vocabulary: walk, run, jog, jump, dance, march, climb, kick, punch, sit, stretch, throw, wave, boxing, turn, forward.

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support