YAML Metadata Warning:empty or missing yaml metadata in repo card

Check out the documentation for more information.

Constraint model — documentation

What this is, in one sentence

We extended the standard T2M-GPT (text-to-motion) model so that, instead of just generating the motion the caption implies, it also follows a user-specified XZ path — a list of waypoints (frame, x, z) that the generated character has to pass through at the requested frames.

You'll get something like:

Caption: "A neutral paced forward walk from start to finish"
Waypoints: (0, 0.0, 0.0), (100, 1.5, 4.0), (199, 0.0, 8.0)
→ 200 frames of motion where the character walks along that S-curve.

What you already know vs. what's new

You already know T2M-GPT:

CLIP encodes the caption → 512-d feature.
An autoregressive transformer (block_size 51) emits VQ token ids conditioned on the caption.
A frozen VQ decoder turns those tokens into a sequence of 310-d body features (~204 frames max).
Decode the body features into joints/positions and render.

The body's root translation in step (3) is whatever the model wants — you cannot ask it to "walk along this curve". That's the gap this work closes.

What we add (the constraint capability)

We bolt on two new modules and keep the T2M GPT itself almost unchanged:

A trajectory controller — a small recurrent network that takes (caption + waypoints) and rolls out a clean root path P that hits the waypoints.
A recompose step — purely geometric (no parameters): it strips the body's own root, drops the body onto the controller's path, time-aligns it so the feet don't slide, and rotates it to face the controller's heading.

We also finetune the T2M body in one specific way (sliding-window training) so we can generate motions longer than the model's ~200-frame context window without drift. (this is not fixed yet)

That's the whole story. The rest of this doc explains why each piece exists, what it's doing, and how to use it.

Why decouple? (why a separate controller instead of finetuning T2M)

The obvious thing would be: feed the waypoints into T2M-GPT as extra tokens and finetune. We tried that. It does not work well:

T2M's training data has no path-conditioned supervision. Adding a constraint stream just looks like noise to it; gait quality drops sharply and waypoint-following is mediocre.
The body GPT is huge (CLIP + transformer + VQ) and slow to fine-tune for each new conditioning idea.
Single-pass length is hard-capped at ~200 frames by block_size — even a good fine-tune can't break that.

Decoupling sidesteps all three problems:

The path-following problem is easy on its own (a small GRU + a synthesizer for waypoints). It's the natural place to add constraints.
The body GPT keeps doing what it's good at — mapping caption → gait. We just discard its root translation and feed our own.
The recompose step is deterministic geometry. No learning, no failure mode you can't reason about.

In short: path and gait are two separable problems. Each model solves the one it's good at.

The pipeline (high-level)

                       caption
                          │
        waypoints         │
        (frame, x, z)     │
              │           │
              ▼           ▼
   ┌────────────────────────────────┐         ┌────────────────────────┐
   │ TRAJECTORY CONTROLLER          │         │ T2M body GPT           │
   │   GRU, closed-loop rollout     │         │   (the one you know,   │
   │   conditioned on caption +     │         │    block_size 51,      │
   │   waypoint memory              │         │    finetuned on        │
   │                                │         │    walk+jog +          │
   │ Outputs (per frame t):         │         │    sliding-window      │
   │   P[t]   = root xz on path     │         │    training)           │
   │   head[t]= integrated heading  │         └──────────┬─────────────┘
   │   delta[t]= predicted yaw off. │                    │ tokens
   └─┬──────────────────────────────┘                    ▼
     │                                              VQ decoder
     │                                                   │  body[T, 310]
     │                                                   │  (gait, foot contacts,
     │                                                   │   joint poses — but
     │                                                   │   ALSO a root path the
     │                                                   │   model invented; we
     │                                                   │   throw that root away)
     │                                                   │
     │                            A-WARP (time-warp body to match controller speed
     │                            so the gait phase stays consistent with travel — feet
     │                            don't slide along the new path)
     │                                                   │
     ▼                                                   ▼
                                  RECOMPOSE
                       • body root  ← P[t] (controller's path)
                       • body facing ← head[t] + face_sign·delta[t]
                                         (the "head-facing rule" — see below)
                                          │
                                          ▼
                          final motion clip (kimodo-369 layout)

That kimodo-369 is the same body format T2M-GPT uses internally: [0:3] root pos · [5:95] joint local · [95:275] joint rot6d · [275:365] joint angular vel · [365:369] foot contacts.

What each block is doing

1. Trajectory controller (the new piece)

Goal: given a caption and a list of (frame, x, z) waypoints, produce a smooth root path P[T, 2] that hits each waypoint at the requested frame.

Architecture: a small closed-loop GRU. Each step it sees:

The current root xz,
The CLIP caption feature (so it can pick the right speed/style — a walk caption ≠ a jog caption),
A "constraint memory": the not-yet-visited waypoints + their target frames, encoded via a tiny attention block.

It outputs the next root step and two extra signals that the recompose needs:

head[t] = the controller's integrated heading — literally atan2(Δz, Δx) of its own path step, accumulated. Clean and noise-free at any speed (this matters for low-speed jitter, see §4).
delta[t] = a predicted yaw offset from the travel direction (so the character can face slightly inward on curves, lean into turns, etc.).

Training data: real motion clips' GT paths, with synthesized waypoints drawn from those paths (single keyframe, multi-keyframe, sparse trajectory, dense trajectory, full trajectory). The caption used is the real clip's caption. So the controller learns "what kind of path is plausible for this caption while passing through these waypoints".

Why it's a separate model: see "Why decouple" above. Also: the controller is small (~2 MB), trains in hours, and we can iterate on it (v9 → v18) without touching the body GPT.

2. T2M body GPT

This is the model you already know:

Text2Motion_Transformer, 4 layers / 512 d / 8 heads, block_size 51, num_vq 512.
Input: CLIP caption feature (only). No path tokens, no constraints (in the proven config, the constraint-conditioning inputs were zero-init and unused — the path-following lives entirely in the controller + recompose).
Output: VQ token ids, decoded into 310-d body features.

What about a bigger block_size?

We tried it: GPT_t2m_window_walkjog_big_v1, block_size 128, 6 layers, 768-d, trained from scratch for 100 k iters. It did not help. Sustained length was still ~787 frames — basically identical to the small model. The bottleneck is not the context window; it's the EOS-emission bias from a training set where the median clip is only 141 frames. So we stick with the 51-tok windowed model.

3. Recompose (the new geometry step)

The body GPT generated a 310-d body sequence. Its root translation is whatever the model invented — we discard it. We keep the body's:

Joint poses (limbs swinging, posture).
Foot contacts.
Body facing relative to its own travel direction.

…and put them onto the controller's path. Two technical details:

A-warp (time-warp the body to controller speed)

If the controller's path is 10 m long over 200 frames but the body's own travel is 6 m over 200 frames, naively dropping joints onto the path makes the feet slide (the legs cycle at "6 m / 200 frames" speed but the root moves "10 m / 200 frames"). The A-warp resamples the body in time so its arc-length distribution matches the controller's. After A-warp the gait phase aligns with travel → no foot skating.

The stack to use right now

Role	Component	File	md5
Trajectory controller	ctrl_head_v18	`ctrl_head_v18/ctrl_best.pth`	`7c57187f330d0f382464d825422e6287`
Body GPT (T2M, walk+jog, sliding-window finetune)	GPT_t2m_window_walkjog_v1	`GPT_t2m_window_walkjog_v1/net_last.pth`	`43c16f5dbc8ddd0475032460a80136df`
VQ-VAE (HFbig)	`vqvae.pth`	—	`85a2adf008fc6c6b49af011cfc8316bc`
Dataset stats	`Mean.npy`, `Std.npy`, `ActiveDims.npy`, `ConstFill.npy`	—	—
Engine (glue code)	`demo/decoupled_engine.py::DecoupledT2MEngine`	repo, branch `refactor`	—

All ckpts are on HuggingFace at https://huggingface.co/mpilligua/car-t2m-decoupled-v18 (see "Checkpoints" below for direct URLs and a one-line huggingface-cli download).

Inference (minimal Python)

import torch, numpy as np
import clip
from demo.decoupled_engine import DecoupledT2MEngine
from models.constraints.representation import ConstraintList, ROOT_XZ, JOINT_ROOT_XZ

dev = torch.device("cuda")
DR  = "<path>/bones_stats"                # holds Mean.npy / Std.npy / ActiveDims.npy / ConstFill.npy
eng = DecoupledT2MEngine(
    body_ckpt = "<path>/GPT_t2m_window_walkjog_v1/net_last.pth",
    vq_ckpt   = "<path>/vqvae.pth",
    ctrl_ckpt = "<path>/ctrl_head_v18/ctrl_best.pth",
    device    = dev,
    data_root = DR,
    face_sign = -1.0,                    # convention; do NOT change
)

# CLIP for the caption feature (used by the controller; the body uses its own CLIP inside the bundle).
cm, _ = clip.load("ViT-B/32", device=dev, jit=False)
clip.model.convert_weights(cm); cm.eval()

# 1. Build a constraint list: (frame, x_meters, z_meters) waypoints.
#    +z is forward in the dataset convention. Frames are integer indices in [0, T-1].
cl = ConstraintList.empty()
for fr, x, z in [(0, 0.0, 0.0), (100, 1.5, 4.0), (199, 0.0, 8.0)]:
    cl = cl.append(int(fr), JOINT_ROOT_XZ, ROOT_XZ, np.array([x, z], np.float32))

# 2. Encode the caption.
caption = "a neutral paced forward walk from start to finish"
feat = cm.encode_text(clip.tokenize([caption], truncate=True).to(dev)).float()  # [1, 512]

# 3. Decode.
f369 = eng.feats369(feat, cl, T=200, caption=caption)   # np.float32 [T, 369]

# f369 is kimodo-369; root xz = f369[:, [0, 2]]. For a quick top-down render:
# from viz.render_topdown_mp4 import _feats_369_to_xy_skeleton
# joints, parents = _feats_369_to_xy_skeleton(f369)        # joints [T, 30, 3]  (x, y_up, z)

Caption-only mode (no controller, no constraints)

If you just want the T2M body without any path conditioning:

f369 = eng.free369(caption="A person jogs forward at a steady pace.")
# Returns the model's OWN motion + OWN root path (natural EOS).

Captions that work well

The body GPT is caption-distribution sensitive — it produces clean motion mainly for captions whose phrasing it saw a lot during training. Validated picks:

"a neutral paced forward walk from start to finish" (282 training clips, the gold standard)
"A person is walking forward at a steady pace."
"A person jogs forward at a steady pace."
"person jogs forward without speeding up to a run"

Terse prompts ("walk", "jog") and free-form descriptive prompts often EOS early. The viser GUI ships a dropdown of validated captions.

Viser demo

The interactive 3D demo is demo/viser_decoupled_app.py. It wraps DecoupledT2MEngine with a viser server.

Running it

# On a GPU node (matches the T2M-GPT conda env this project ships with):
python demo/viser_decoupled_app.py
# Listens on 0.0.0.0:8081

# From your laptop, port-forward and open http://localhost:8081
ssh -L 8081:<gpu-node>:8081 <login-host>

The demo expects local paths to the checkpoints. Edit lines ~32 and 37–40 of demo/viser_decoupled_app.py:

DR  = "<your path to bones_stats>"
eng = DecoupledT2MEngine(
    "<your path>/GPT_t2m_window_walkjog_v1/net_last.pth",
    "<your path>/vqvae.pth",
    "<your path>/ctrl_head_v18/ctrl_best.pth",
    dev, data_root=DR)

What the GUI lets you do

Preset constraints (5 modes) derived from a validated real clip (neutral_walk_180_R_001__A073): single, multi, small_traj, big_traj, full_traj. Pick these first; they're in-distribution and produce the cleanest motion.
Manual waypoints: click anywhere on the XZ ground plane → it fills (x, z); pick a frame index; press + add waypoint. Each waypoint has its own frame, so you can build multi-keyframe constraints. Be mindful of what frame you are placing the constraint in, if it doesnt make sense the model will not perform well.
Trajectory builder: pick a shape (straight, curve_left, curve_right, zigzag, s_curve, arc90), choose #waypoints, total forward length (m), lateral amplitude (m). The demo lays the waypoints equally along the path's arc length and assigns frames evenly across the clip — produces ~constant speed, which keeps the controller's input in-distribution.
Training-caption dropdown of real walk/jog captions, plus a free-text caption box.
NO constraints (model path) checkbox: run the T2M body alone, no controller, no constraints. Useful for seeing what the body wants to do unaided.
Max frames (20–600): caps the controller horizon. Actual rendered length = min(body sliding-window length, controller path length).
Autoplay + manual frame slider; constraints rendered as spheres (red = preset, green = manual / trajectory builder).
Camera: drag = orbit, right-drag = pan, scroll = zoom (Y-up, XZ ground).

The rendering joint extractor is viz/render_topdown_mp4._feats_369_to_xy_skeleton — use that, not Pau's feats_369_to_motion (the latter does an axis swap that's wrong for our recompose output).

Checkpoints

Everything is on the HF model repo: https://huggingface.co/mpilligua/car-t2m-decoupled-v18.

One-line download (recommended):

pip install huggingface_hub
huggingface-cli download mpilligua/car-t2m-decoupled-v18 \
    --local-dir ./decoupled_assets --local-dir-use-symlinks=False
# Then point the demo paths at:
#   ./decoupled_assets/ctrl_head_v18/ctrl_best.pth
#   ./decoupled_assets/GPT_t2m_window_walkjog_v1/net_last.pth
#   ./decoupled_assets/vqvae.pth
#   DR = ./decoupled_assets/bones_stats

Individual file URLs:

File	URL	md5
Controller v18	https://huggingface.co/mpilligua/car-t2m-decoupled-v18/resolve/main/ctrl_head_v18/ctrl_best.pth	`7c57187f330d0f382464d825422e6287`
Body GPT (windowed)	https://huggingface.co/mpilligua/car-t2m-decoupled-v18/resolve/main/GPT_t2m_window_walkjog_v1/net_last.pth	`43c16f5dbc8ddd0475032460a80136df`
VQ-VAE	https://huggingface.co/mpilligua/car-t2m-decoupled-v18/resolve/main/vqvae.pth	`85a2adf008fc6c6b49af011cfc8316bc`
Mean.npy	https://huggingface.co/mpilligua/car-t2m-decoupled-v18/resolve/main/bones_stats/Mean.npy	—
Std.npy	https://huggingface.co/mpilligua/car-t2m-decoupled-v18/resolve/main/bones_stats/Std.npy	—
ActiveDims.npy	https://huggingface.co/mpilligua/car-t2m-decoupled-v18/resolve/main/bones_stats/ActiveDims.npy	—
ConstFill.npy	https://huggingface.co/mpilligua/car-t2m-decoupled-v18/resolve/main/bones_stats/ConstFill.npy	—

Limitations & known facts

Single-pass length ≈ 200 frames by block_size 51 × 4 frames/token. Longer motion comes from DecoupledT2MEngine's sliding-window continuation at inference; that's why the windowed body finetune is important (it was trained for this).
Caption distribution sensitive. Stick to validated captions; terse/odd phrasings get short or poor motion.
Jog clips in the training data are shorter than walk clips, so even with the windowed finetune the jog gait EOS's earlier in the natural-sampling path. Sliding-window inference compensates.
Don't re-introduce atan2(Δpath) in the recompose — that brings back the low-speed heading jitter that v17 + the head-facing rule fixed.
Bigger arch was a wash. A from-scratch block_size 128 / 768-d / 6-layer model gave the same sustained length as the 51-tok windowed finetune. The training-data EOS bias is the real bottleneck.

Where to look in the repo

What	Where
Engine (glue: controller + body + recompose)	`demo/decoupled_engine.py` (class `DecoupledT2MEngine`)
Interactive viser demo	`demo/viser_decoupled_app.py`
Trusted joint extractor for render	`viz/render_topdown_mp4.py` (`_feats_369_to_xy_skeleton`)
Reference standalone merge eval	`viz/eval_merge_decoupled.py` (controller-only render, no demo loop)
Constraint list dataclass	`models/constraints/representation.py`
Controller code	`training/train_trajectory_controller.py` (RCP-owned)
Body GPT training (T2M)	`training/train_gpt.py`
Sliding-window training switch	env `CART2M_WINDOW=1` (in `data/dataset_TM_train.py`); uncapped token cache via `CART2M_LONG_CACHE_LEN=100000` (in `data/dataset_tokenize.py`). Defaults preserve original behavior.

Branch: refactor. Demo currently runs from this branch's HEAD.

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support