YAML Metadata Warning:empty or missing yaml metadata in repo card
Check out the documentation for more information.
- Constraint model β documentation
- What this is, in one sentence
- What you already know vs. what's new
- Why decouple? (why a separate controller instead of finetuning T2M)
- The pipeline (high-level)
- What each block is doing
- The stack to use right now
- Inference (minimal Python)
- Viser demo
- Checkpoints
- Limitations & known facts
- Where to look in the repo
- What this is, in one sentence
Constraint model β documentation
What this is, in one sentence
We extended the standard T2M-GPT (text-to-motion) model so that, instead of
just generating the motion the caption implies, it also follows a
user-specified XZ path β a list of waypoints (frame, x, z) that the
generated character has to pass through at the requested frames.
You'll get something like:
Caption: "A neutral paced forward walk from start to finish"
Waypoints:(0, 0.0, 0.0),(100, 1.5, 4.0),(199, 0.0, 8.0)
β 200 frames of motion where the character walks along that S-curve.
What you already know vs. what's new
You already know T2M-GPT:
- CLIP encodes the caption β 512-d feature.
- An autoregressive transformer (block_size 51) emits VQ token ids conditioned on the caption.
- A frozen VQ decoder turns those tokens into a sequence of 310-d body features (~204 frames max).
- Decode the body features into joints/positions and render.
The body's root translation in step (3) is whatever the model wants β you cannot ask it to "walk along this curve". That's the gap this work closes.
What we add (the constraint capability)
We bolt on two new modules and keep the T2M GPT itself almost unchanged:
- A trajectory controller β a small recurrent network that takes
(caption + waypoints)and rolls out a clean root path P that hits the waypoints. - A recompose step β purely geometric (no parameters): it strips the body's own root, drops the body onto the controller's path, time-aligns it so the feet don't slide, and rotates it to face the controller's heading.
We also finetune the T2M body in one specific way (sliding-window training) so we can generate motions longer than the model's ~200-frame context window without drift. (this is not fixed yet)
That's the whole story. The rest of this doc explains why each piece exists, what it's doing, and how to use it.
Why decouple? (why a separate controller instead of finetuning T2M)
The obvious thing would be: feed the waypoints into T2M-GPT as extra tokens and finetune. We tried that. It does not work well:
- T2M's training data has no path-conditioned supervision. Adding a constraint stream just looks like noise to it; gait quality drops sharply and waypoint-following is mediocre.
- The body GPT is huge (CLIP + transformer + VQ) and slow to fine-tune for each new conditioning idea.
- Single-pass length is hard-capped at ~200 frames by block_size β even a good fine-tune can't break that.
Decoupling sidesteps all three problems:
- The path-following problem is easy on its own (a small GRU + a synthesizer for waypoints). It's the natural place to add constraints.
- The body GPT keeps doing what it's good at β mapping
caption β gait. We just discard its root translation and feed our own. - The recompose step is deterministic geometry. No learning, no failure mode you can't reason about.
In short: path and gait are two separable problems. Each model solves the one it's good at.
The pipeline (high-level)
caption
β
waypoints β
(frame, x, z) β
β β
βΌ βΌ
ββββββββββββββββββββββββββββββββββ ββββββββββββββββββββββββββ
β TRAJECTORY CONTROLLER β β T2M body GPT β
β GRU, closed-loop rollout β β (the one you know, β
β conditioned on caption + β β block_size 51, β
β waypoint memory β β finetuned on β
β β β walk+jog + β
β Outputs (per frame t): β β sliding-window β
β P[t] = root xz on path β β training) β
β head[t]= integrated heading β ββββββββββββ¬ββββββββββββββ
β delta[t]= predicted yaw off. β β tokens
βββ¬βββββββββββββββββββββββββββββββ βΌ
β VQ decoder
β β body[T, 310]
β β (gait, foot contacts,
β β joint poses β but
β β ALSO a root path the
β β model invented; we
β β throw that root away)
β β
β A-WARP (time-warp body to match controller speed
β so the gait phase stays consistent with travel β feet
β don't slide along the new path)
β β
βΌ βΌ
RECOMPOSE
β’ body root β P[t] (controller's path)
β’ body facing β head[t] + face_signΒ·delta[t]
(the "head-facing rule" β see below)
β
βΌ
final motion clip (kimodo-369 layout)
That kimodo-369 is the same body format T2M-GPT uses internally:
[0:3] root pos Β· [5:95] joint local Β· [95:275] joint rot6d Β· [275:365] joint angular vel Β· [365:369] foot contacts.
What each block is doing
1. Trajectory controller (the new piece)
Goal: given a caption and a list of (frame, x, z) waypoints, produce a smooth root path P[T, 2] that hits each waypoint at the requested frame.
Architecture: a small closed-loop GRU. Each step it sees:
- The current root xz,
- The CLIP caption feature (so it can pick the right speed/style β a walk caption β a jog caption),
- A "constraint memory": the not-yet-visited waypoints + their target frames, encoded via a tiny attention block.
It outputs the next root step and two extra signals that the recompose needs:
head[t]= the controller's integrated heading β literallyatan2(Ξz, Ξx)of its own path step, accumulated. Clean and noise-free at any speed (this matters for low-speed jitter, see Β§4).delta[t]= a predicted yaw offset from the travel direction (so the character can face slightly inward on curves, lean into turns, etc.).
Training data: real motion clips' GT paths, with synthesized waypoints drawn from those paths (single keyframe, multi-keyframe, sparse trajectory, dense trajectory, full trajectory). The caption used is the real clip's caption. So the controller learns "what kind of path is plausible for this caption while passing through these waypoints".
Why it's a separate model: see "Why decouple" above. Also: the controller is small (~2 MB), trains in hours, and we can iterate on it (v9 β v18) without touching the body GPT.
2. T2M body GPT
This is the model you already know:
- Text2Motion_Transformer, 4 layers / 512 d / 8 heads, block_size 51, num_vq 512.
- Input: CLIP caption feature (only). No path tokens, no constraints (in the proven config, the constraint-conditioning inputs were zero-init and unused β the path-following lives entirely in the controller + recompose).
- Output: VQ token ids, decoded into 310-d body features.
What about a bigger block_size?
We tried it: GPT_t2m_window_walkjog_big_v1, block_size 128, 6 layers, 768-d, trained from scratch for 100 k iters. It did not help. Sustained length was still ~787 frames β basically identical to the small model. The bottleneck is not the context window; it's the EOS-emission bias from a training set where the median clip is only 141 frames. So we stick with the 51-tok windowed model.
3. Recompose (the new geometry step)
The body GPT generated a 310-d body sequence. Its root translation is whatever the model invented β we discard it. We keep the body's:
- Joint poses (limbs swinging, posture).
- Foot contacts.
- Body facing relative to its own travel direction.
β¦and put them onto the controller's path. Two technical details:
A-warp (time-warp the body to controller speed)
If the controller's path is 10 m long over 200 frames but the body's own travel is 6 m over 200 frames, naively dropping joints onto the path makes the feet slide (the legs cycle at "6 m / 200 frames" speed but the root moves "10 m / 200 frames"). The A-warp resamples the body in time so its arc-length distribution matches the controller's. After A-warp the gait phase aligns with travel β no foot skating.
The stack to use right now
| Role | Component | File | md5 |
|---|---|---|---|
| Trajectory controller | ctrl_head_v18 | ctrl_head_v18/ctrl_best.pth |
7c57187f330d0f382464d825422e6287 |
| Body GPT (T2M, walk+jog, sliding-window finetune) | GPT_t2m_window_walkjog_v1 | GPT_t2m_window_walkjog_v1/net_last.pth |
43c16f5dbc8ddd0475032460a80136df |
| VQ-VAE (HFbig) | vqvae.pth |
β | 85a2adf008fc6c6b49af011cfc8316bc |
| Dataset stats | Mean.npy, Std.npy, ActiveDims.npy, ConstFill.npy |
β | β |
| Engine (glue code) | demo/decoupled_engine.py::DecoupledT2MEngine |
repo, branch refactor |
β |
All ckpts are on HuggingFace at https://huggingface.co/mpilligua/car-t2m-decoupled-v18 (see "Checkpoints" below for direct URLs and a one-line huggingface-cli download).
Inference (minimal Python)
import torch, numpy as np
import clip
from demo.decoupled_engine import DecoupledT2MEngine
from models.constraints.representation import ConstraintList, ROOT_XZ, JOINT_ROOT_XZ
dev = torch.device("cuda")
DR = "<path>/bones_stats" # holds Mean.npy / Std.npy / ActiveDims.npy / ConstFill.npy
eng = DecoupledT2MEngine(
body_ckpt = "<path>/GPT_t2m_window_walkjog_v1/net_last.pth",
vq_ckpt = "<path>/vqvae.pth",
ctrl_ckpt = "<path>/ctrl_head_v18/ctrl_best.pth",
device = dev,
data_root = DR,
face_sign = -1.0, # convention; do NOT change
)
# CLIP for the caption feature (used by the controller; the body uses its own CLIP inside the bundle).
cm, _ = clip.load("ViT-B/32", device=dev, jit=False)
clip.model.convert_weights(cm); cm.eval()
# 1. Build a constraint list: (frame, x_meters, z_meters) waypoints.
# +z is forward in the dataset convention. Frames are integer indices in [0, T-1].
cl = ConstraintList.empty()
for fr, x, z in [(0, 0.0, 0.0), (100, 1.5, 4.0), (199, 0.0, 8.0)]:
cl = cl.append(int(fr), JOINT_ROOT_XZ, ROOT_XZ, np.array([x, z], np.float32))
# 2. Encode the caption.
caption = "a neutral paced forward walk from start to finish"
feat = cm.encode_text(clip.tokenize([caption], truncate=True).to(dev)).float() # [1, 512]
# 3. Decode.
f369 = eng.feats369(feat, cl, T=200, caption=caption) # np.float32 [T, 369]
# f369 is kimodo-369; root xz = f369[:, [0, 2]]. For a quick top-down render:
# from viz.render_topdown_mp4 import _feats_369_to_xy_skeleton
# joints, parents = _feats_369_to_xy_skeleton(f369) # joints [T, 30, 3] (x, y_up, z)
Caption-only mode (no controller, no constraints)
If you just want the T2M body without any path conditioning:
f369 = eng.free369(caption="A person jogs forward at a steady pace.")
# Returns the model's OWN motion + OWN root path (natural EOS).
Captions that work well
The body GPT is caption-distribution sensitive β it produces clean motion mainly for captions whose phrasing it saw a lot during training. Validated picks:
"a neutral paced forward walk from start to finish"(282 training clips, the gold standard)"A person is walking forward at a steady pace.""A person jogs forward at a steady pace.""person jogs forward without speeding up to a run"
Terse prompts ("walk", "jog") and free-form descriptive prompts often EOS early. The viser GUI ships a dropdown of validated captions.
Viser demo
The interactive 3D demo is demo/viser_decoupled_app.py. It wraps DecoupledT2MEngine with a viser server.
Running it
# On a GPU node (matches the T2M-GPT conda env this project ships with):
python demo/viser_decoupled_app.py
# Listens on 0.0.0.0:8081
# From your laptop, port-forward and open http://localhost:8081
ssh -L 8081:<gpu-node>:8081 <login-host>
The demo expects local paths to the checkpoints. Edit lines ~32 and 37β40 of demo/viser_decoupled_app.py:
DR = "<your path to bones_stats>"
eng = DecoupledT2MEngine(
"<your path>/GPT_t2m_window_walkjog_v1/net_last.pth",
"<your path>/vqvae.pth",
"<your path>/ctrl_head_v18/ctrl_best.pth",
dev, data_root=DR)
What the GUI lets you do
- Preset constraints (5 modes) derived from a validated real clip (
neutral_walk_180_R_001__A073):single,multi,small_traj,big_traj,full_traj. Pick these first; they're in-distribution and produce the cleanest motion. - Manual waypoints: click anywhere on the XZ ground plane β it fills
(x, z); pick a frame index; press + add waypoint. Each waypoint has its own frame, so you can build multi-keyframe constraints. Be mindful of what frame you are placing the constraint in, if it doesnt make sense the model will not perform well. - Trajectory builder: pick a shape (
straight,curve_left,curve_right,zigzag,s_curve,arc90), choose#waypoints, total forward length (m), lateral amplitude (m). The demo lays the waypoints equally along the path's arc length and assigns frames evenly across the clip β produces ~constant speed, which keeps the controller's input in-distribution. - Training-caption dropdown of real walk/jog captions, plus a free-text caption box.
- NO constraints (model path) checkbox: run the T2M body alone, no controller, no constraints. Useful for seeing what the body wants to do unaided.
- Max frames (20β600): caps the controller horizon. Actual rendered length = min(body sliding-window length, controller path length).
- Autoplay + manual frame slider; constraints rendered as spheres (red = preset, green = manual / trajectory builder).
- Camera: drag = orbit, right-drag = pan, scroll = zoom (Y-up, XZ ground).
The rendering joint extractor is viz/render_topdown_mp4._feats_369_to_xy_skeleton β use that, not Pau's feats_369_to_motion (the latter does an axis swap that's wrong for our recompose output).
Checkpoints
Everything is on the HF model repo: https://huggingface.co/mpilligua/car-t2m-decoupled-v18.
One-line download (recommended):
pip install huggingface_hub
huggingface-cli download mpilligua/car-t2m-decoupled-v18 \
--local-dir ./decoupled_assets --local-dir-use-symlinks=False
# Then point the demo paths at:
# ./decoupled_assets/ctrl_head_v18/ctrl_best.pth
# ./decoupled_assets/GPT_t2m_window_walkjog_v1/net_last.pth
# ./decoupled_assets/vqvae.pth
# DR = ./decoupled_assets/bones_stats
Individual file URLs:
| File | URL | md5 |
|---|---|---|
| Controller v18 | https://huggingface.co/mpilligua/car-t2m-decoupled-v18/resolve/main/ctrl_head_v18/ctrl_best.pth | 7c57187f330d0f382464d825422e6287 |
| Body GPT (windowed) | https://huggingface.co/mpilligua/car-t2m-decoupled-v18/resolve/main/GPT_t2m_window_walkjog_v1/net_last.pth | 43c16f5dbc8ddd0475032460a80136df |
| VQ-VAE | https://huggingface.co/mpilligua/car-t2m-decoupled-v18/resolve/main/vqvae.pth | 85a2adf008fc6c6b49af011cfc8316bc |
| Mean.npy | https://huggingface.co/mpilligua/car-t2m-decoupled-v18/resolve/main/bones_stats/Mean.npy | β |
| Std.npy | https://huggingface.co/mpilligua/car-t2m-decoupled-v18/resolve/main/bones_stats/Std.npy | β |
| ActiveDims.npy | https://huggingface.co/mpilligua/car-t2m-decoupled-v18/resolve/main/bones_stats/ActiveDims.npy | β |
| ConstFill.npy | https://huggingface.co/mpilligua/car-t2m-decoupled-v18/resolve/main/bones_stats/ConstFill.npy | β |
Limitations & known facts
- Single-pass length β 200 frames by block_size 51 Γ 4 frames/token. Longer motion comes from
DecoupledT2MEngine's sliding-window continuation at inference; that's why the windowed body finetune is important (it was trained for this). - Caption distribution sensitive. Stick to validated captions; terse/odd phrasings get short or poor motion.
- Jog clips in the training data are shorter than walk clips, so even with the windowed finetune the jog gait EOS's earlier in the natural-sampling path. Sliding-window inference compensates.
- Don't re-introduce
atan2(Ξpath)in the recompose β that brings back the low-speed heading jitter that v17 + the head-facing rule fixed. - Bigger arch was a wash. A from-scratch block_size 128 / 768-d / 6-layer model gave the same sustained length as the 51-tok windowed finetune. The training-data EOS bias is the real bottleneck.
Where to look in the repo
| What | Where |
|---|---|
| Engine (glue: controller + body + recompose) | demo/decoupled_engine.py (class DecoupledT2MEngine) |
| Interactive viser demo | demo/viser_decoupled_app.py |
| Trusted joint extractor for render | viz/render_topdown_mp4.py (_feats_369_to_xy_skeleton) |
| Reference standalone merge eval | viz/eval_merge_decoupled.py (controller-only render, no demo loop) |
| Constraint list dataclass | models/constraints/representation.py |
| Controller code | training/train_trajectory_controller.py (RCP-owned) |
| Body GPT training (T2M) | training/train_gpt.py |
| Sliding-window training switch | env CART2M_WINDOW=1 (in data/dataset_TM_train.py); uncapped token cache via CART2M_LONG_CACHE_LEN=100000 (in data/dataset_tokenize.py). Defaults preserve original behavior. |
Branch: refactor. Demo currently runs from this branch's HEAD.