GR00T-N1.7 × GEAR-SONIC — prompt-conditioned whole-body motion for Unitree G1

⚠️ Proof-of-concept (derisk checkpoint) — read first

This model memorizes 7 motions, 1 episode each. Generalization to new prompts or new motions is untested. The headline MSE 0.0011 is train-set reconstruction error (no held-out split), i.e. evidence of memorization — not generalization. Every demo clip below was recorded with deviation/fall terminations disabled (RELAX=1, thresholds = 99), so the clips show prompt → token tracking, not validated balance or fall-resistance. It is not a deployable general humanoid VLA. Full caveats in Honest scope & limitations.

A single GR00T N1.7 vision-language-action model fine-tuned to emit the 64-dim FSQ motion token of the GEAR-SONIC whole-body controller (WBC). One model, one prompt → the SONIC WBC decodes a full-body G1 motion (the WBC supplies balance/recovery). No per-joint behavior cloning.

One prompt-conditioned model, not 7 separate ones. prompt → GR00T → 64-d token → SONIC.decode → 29-DoF motion.

🔗 Code, training & validation scaffolding: vitorcen/isaaclab-experience (LeSONIC/scripts/gear_sonic_live.sh, gear_sonic/data/vla_live_injector.py)

🎥 Closed-loop demos (GR00T-in-the-loop, no C++/DDS)

Each clip below is the live closed loop: every control step the policy reads the G1's ego camera + proprioception + the prompt, GR00T predicts a token chunk over ZMQ, and SONIC's WBC decodes it to joint targets. The green skeleton is the reference target; the robot follows GR00T's token. <motion> = the prompt.

⚠️ All clips run with RELAX=1 — the deviation/fall terminations (anchor_pos, ee_body_pos, anchor_ori_full, foot_pos_xyz) are set to threshold 99, i.e. effectively off, so the full motion plays without the normal 1–2 s deviation reset. These clips therefore demonstrate prompt → token tracking under relaxed termination, not fall-resistance. "Self-sustaining" below means continues without bootstrap, not "guaranteed upright under strict termination" (that is untested — see scope).

Self-sustaining motions (pure live GR00T, no bootstrap)

macarena — "dance the macarena"

dance — "dance"

squat — "squat"

lunge — "do a forward lunge"

One-shot / traverse motions — require bootstrap

These do not self-start from rest (see scope); the clips were produced with the bootstrap workaround (replay open-loop dump tokens to enter the motion, then hand to live GR00T).

kick — "kick" (bootstrap)

walk — "walk and turn around" (bootstrap)

jump — "jump on one leg" (borderline; bootstrap)

What this is


Base	NVIDIA Isaac GR00T N1.7 (`Gr00tN1d7`), VLM backbone `nvidia/Cosmos-Reason2-2B` (frozen), action_horizon 40
Task	`prompt + ego-view + proprioception → action.motion_token (64-d FSQ)`
Embodiment	`unitree_g1_sonic` (29-DoF G1 + projected gravity)
Decoder	GEAR-SONIC `UniversalTokenModule.decode("g1_dyn", …)` (released WBC checkpoint)
Training	DiT action head + projector fine-tuned, Cosmos VLM frozen; bf16, adafactor, grad-ckpt, single RTX 4090, 8000 steps
Data	7 SONIC deploy-demo motions (dance, macarena, kick, lunge, squat, jump, walk), 1 episode each, recorded as `(token, ego_rgb, proprio)`
Weights	currently stored fp32 (≈12 GB, 3 shards); training was bf16 — a bf16 re-save (≈6 GB) is planned

Results — and how to read them

ℹ️ These numbers are measured on the 7 training trajectories with no held-out split. They quantify fit to the training set, not generalization. Read each with its caveat.

Open-loop token MSE = 0.0011 vs ground-truth tokens, ~35× below the predict-per-motion-mean baseline (0.039). Caveat: this is train-set reconstruction error — strong evidence the model memorized the 7 token sequences. The per-motion-mean baseline is weak for a 7-way set; a real generalization test (held-out frames / held-out motion / paraphrased prompt) is planned, not yet run.
Prompt is a control knob: with a fixed observation, swapping only the prompt shifts the predicted token by L2 3.2–5.4 (token std ≈ 0.18). Caveat: the observation dominates the prompt (obs L2 ≈ 11.8 vs prompt L2 ≈ 3–5) — cleanest prompt-driven selection is from a neutral rest pose; mid-motion the prompt's marginal influence is small.
Closed-loop self-sustains 4–5 / 7 motions (squat, lunge, dance, macarena) with pure live GR00T; kick/walk/jump need bootstrap. Caveat: measured under RELAX=1 (fall/deviation termination disabled). The pelvis stays ≈ 0.79 m in these clips, but this is not a fall-resistance claim — the fall criterion was switched off. Stability under strict termination, success-rate, fall-rate and phase-drift are untested (planned: strict-termination re-eval over multiple seeds, bootstrap / no-bootstrap reported separately).
Checkpoint-8000 was chosen by monotonically-decreasing open-loop MSE on the training trajectories — there is no validation curve, so the selection is on train-set fit.

⚠️ Honest scope & limitations

This is a derisk / proof-of-concept checkpoint, not a general humanoid VLA. Please read before using:

Training-set only — generalization untested. Trained on 7 motions, 1 episode each; new-motion / new-prompt held-out generalization has not been evaluated. Treat as memorization of these 7 skills. With 1.62 B trainable params over ~3815 frames, the action head can simply memorize the trajectories.
Memoryless single-frame policy (architectural). GR00T predicts a 40-step token chunk from a single frame of obs — no history, no phase/progress signal. At the standing rest pose it can't tell "about to kick" from "just finished", so single-shot return-to-rest motions (kick, walk, jump) settle into standing and need the bootstrap workaround (no retraining): BOOTSTRAP=80 bash LeSONIC/scripts/gear_sonic_live.sh kick. This is an observation-design limitation, not a tuning one — adding data alone will not fix it.
FSQ token modelling caveat. The 64-dim token is discrete (FSQ, 16 levels/dim); the action head regresses it as a continuous flow-matching target. On-distribution this lands near the grid; off-distribution predictions can fall between grid points. A discrete objective (per-dim CE / a pre-quantization latent passed through SONIC's quantizer) is the cleaner formulation (future work).
The balance is the WBC's, not this model's. "Doesn't fall" in the demos is largely the pretrained SONIC controller; the VLA's contribution is not yet decoupled by ablation (WBC-replay / mean-token / prompt-shuffle / proprio-only / vision-off all planned).
Ego camera looks at the ground. It is rigidly attached to the pelvis facing forward and does not see the robot's own body — the visual signal for dance/pose is weak; the policy leans on proprioception + prompt.
Not runnable standalone. Requires the GEAR-SONIC WBC checkpoint + the unitree_g1_sonic embodiment config + the live injector + a running GR00T ZMQ server (GR00T and SONIC can't share a process — transformers version conflict). See the repo for env lockfile, ZMQ wire schema, token extraction and eval commands.

How to run (closed loop)

# 1. start the GR00T inference server (in the Isaac-GR00T venv)
python -m gr00t.eval.run_gr00t_server \
    --model_path <this_checkpoint> --embodiment_tag unitree_g1_sonic --port 5555

# 2. drive SONIC's WBC live, in the Isaac viewer (isaaclab env)
bash LeSONIC/scripts/gear_sonic_live.sh macarena          # self-sustaining: pure live (RELAX=1 by default)
BOOTSTRAP=80 bash LeSONIC/scripts/gear_sonic_live.sh kick  # one-shot: trigger bootstrap then live
# strict termination (real stability test): RELAX=0 bash LeSONIC/scripts/gear_sonic_live.sh macarena

Architecture

prompt ─┐
ego cam ─┼─► GR00T N1.7 ──(ZMQ)──► 64-d motion_token ──► SONIC WBC decode ──► 29-DoF G1 (balanced)
proprio ─┘     (one model)                                (frozen controller)

Skills live in the token space, not in GR00T. GR00T only learns prompt → token; the pretrained WBC supplies balance and recovery.

Citation / links

Validation & critique write-ups (single-file HTML, in the code repo): LeSONIC/doc/groot_sonic_wbc_route.html, LeSONIC/doc/sonic_vla_closeloop_validation.html, LeSONIC/doc/sonic_vla_critique_roadmap.html
GEAR-SONIC WBC: https://github.com/NVlabs/GR00T-WholeBodyControl
NVIDIA Isaac GR00T: https://github.com/NVIDIA/Isaac-GR00T