GR00T-N1.7 Γ GEAR-SONIC β prompt-conditioned whole-body motion for Unitree G1
β οΈ Proof-of-concept (derisk checkpoint) β read first
This model memorizes 7 motions, 1 episode each. Generalization to new prompts or new motions is untested. The headline
MSE 0.0011is train-set reconstruction error (no held-out split), i.e. evidence of memorization β not generalization. Every demo clip below was recorded with deviation/fall terminations disabled (RELAX=1, thresholds = 99), so the clips show prompt β token tracking, not validated balance or fall-resistance. It is not a deployable general humanoid VLA. Full caveats in Honest scope & limitations.
A single GR00T N1.7 vision-language-action model fine-tuned to emit the 64-dim FSQ motion token of the GEAR-SONIC whole-body controller (WBC). One model, one prompt β the SONIC WBC decodes a full-body G1 motion (the WBC supplies balance/recovery). No per-joint behavior cloning.
One prompt-conditioned model, not 7 separate ones.
prompt β GR00T β 64-d token β SONIC.decode β 29-DoF motion.
π Code, training & validation scaffolding: vitorcen/isaaclab-experience
(LeSONIC/scripts/gear_sonic_live.sh, gear_sonic/data/vla_live_injector.py)
π₯ Closed-loop demos (GR00T-in-the-loop, no C++/DDS)
Each clip below is the live closed loop: every control step the policy reads the G1's ego
camera + proprioception + the prompt, GR00T predicts a token chunk over ZMQ, and SONIC's WBC
decodes it to joint targets. The green skeleton is the reference target; the robot follows
GR00T's token. <motion> = the prompt.
β οΈ All clips run with
RELAX=1β the deviation/fall terminations (anchor_pos,ee_body_pos,anchor_ori_full,foot_pos_xyz) are set to threshold 99, i.e. effectively off, so the full motion plays without the normal 1β2 s deviation reset. These clips therefore demonstrate prompt β token tracking under relaxed termination, not fall-resistance. "Self-sustaining" below means continues without bootstrap, not "guaranteed upright under strict termination" (that is untested β see scope).
Self-sustaining motions (pure live GR00T, no bootstrap)
macarena β "dance the macarena"
dance β "dance"
squat β "squat"
lunge β "do a forward lunge"
One-shot / traverse motions β require bootstrap
These do not self-start from rest (see scope); the clips were produced with the bootstrap workaround (replay open-loop dump tokens to enter the motion, then hand to live GR00T).
kick β "kick" (bootstrap)
walk β "walk and turn around" (bootstrap)
jump β "jump on one leg" (borderline; bootstrap)
What this is
| Base | NVIDIA Isaac GR00T N1.7 (Gr00tN1d7), VLM backbone nvidia/Cosmos-Reason2-2B (frozen), action_horizon 40 |
| Task | prompt + ego-view + proprioception β action.motion_token (64-d FSQ) |
| Embodiment | unitree_g1_sonic (29-DoF G1 + projected gravity) |
| Decoder | GEAR-SONIC UniversalTokenModule.decode("g1_dyn", β¦) (released WBC checkpoint) |
| Training | DiT action head + projector fine-tuned, Cosmos VLM frozen; bf16, adafactor, grad-ckpt, single RTX 4090, 8000 steps |
| Data | 7 SONIC deploy-demo motions (dance, macarena, kick, lunge, squat, jump, walk), 1 episode each, recorded as (token, ego_rgb, proprio) |
| Weights | currently stored fp32 (β12 GB, 3 shards); training was bf16 β a bf16 re-save (β6 GB) is planned |
Results β and how to read them
βΉοΈ These numbers are measured on the 7 training trajectories with no held-out split. They quantify fit to the training set, not generalization. Read each with its caveat.
- Open-loop token MSE = 0.0011 vs ground-truth tokens, ~35Γ below the predict-per-motion-mean baseline (0.039). Caveat: this is train-set reconstruction error β strong evidence the model memorized the 7 token sequences. The per-motion-mean baseline is weak for a 7-way set; a real generalization test (held-out frames / held-out motion / paraphrased prompt) is planned, not yet run.
- Prompt is a control knob: with a fixed observation, swapping only the prompt shifts the predicted token by L2 3.2β5.4 (token std β 0.18). Caveat: the observation dominates the prompt (obs L2 β 11.8 vs prompt L2 β 3β5) β cleanest prompt-driven selection is from a neutral rest pose; mid-motion the prompt's marginal influence is small.
- Closed-loop self-sustains 4β5 / 7 motions (squat, lunge, dance, macarena) with pure live
GR00T;
kick/walk/jumpneed bootstrap. Caveat: measured underRELAX=1(fall/deviation termination disabled). The pelvis stays β 0.79 m in these clips, but this is not a fall-resistance claim β the fall criterion was switched off. Stability under strict termination, success-rate, fall-rate and phase-drift are untested (planned: strict-termination re-eval over multiple seeds, bootstrap / no-bootstrap reported separately). - Checkpoint-8000 was chosen by monotonically-decreasing open-loop MSE on the training trajectories β there is no validation curve, so the selection is on train-set fit.
β οΈ Honest scope & limitations
This is a derisk / proof-of-concept checkpoint, not a general humanoid VLA. Please read before using:
- Training-set only β generalization untested. Trained on 7 motions, 1 episode each; new-motion / new-prompt held-out generalization has not been evaluated. Treat as memorization of these 7 skills. With 1.62 B trainable params over ~3815 frames, the action head can simply memorize the trajectories.
- Memoryless single-frame policy (architectural). GR00T predicts a 40-step token chunk from a
single frame of obs β no history, no phase/progress signal. At the standing rest pose it can't
tell "about to kick" from "just finished", so single-shot return-to-rest motions (
kick,walk,jump) settle into standing and need the bootstrap workaround (no retraining):BOOTSTRAP=80 bash LeSONIC/scripts/gear_sonic_live.sh kick. This is an observation-design limitation, not a tuning one β adding data alone will not fix it. - FSQ token modelling caveat. The 64-dim token is discrete (FSQ, 16 levels/dim); the action head regresses it as a continuous flow-matching target. On-distribution this lands near the grid; off-distribution predictions can fall between grid points. A discrete objective (per-dim CE / a pre-quantization latent passed through SONIC's quantizer) is the cleaner formulation (future work).
- The balance is the WBC's, not this model's. "Doesn't fall" in the demos is largely the pretrained SONIC controller; the VLA's contribution is not yet decoupled by ablation (WBC-replay / mean-token / prompt-shuffle / proprio-only / vision-off all planned).
- Ego camera looks at the ground. It is rigidly attached to the pelvis facing forward and does not see the robot's own body β the visual signal for dance/pose is weak; the policy leans on proprioception + prompt.
- Not runnable standalone. Requires the GEAR-SONIC WBC checkpoint + the
unitree_g1_sonicembodiment config + the live injector + a running GR00T ZMQ server (GR00T and SONIC can't share a process β transformers version conflict). See the repo for env lockfile, ZMQ wire schema, token extraction and eval commands.
How to run (closed loop)
# 1. start the GR00T inference server (in the Isaac-GR00T venv)
python -m gr00t.eval.run_gr00t_server \
--model_path <this_checkpoint> --embodiment_tag unitree_g1_sonic --port 5555
# 2. drive SONIC's WBC live, in the Isaac viewer (isaaclab env)
bash LeSONIC/scripts/gear_sonic_live.sh macarena # self-sustaining: pure live (RELAX=1 by default)
BOOTSTRAP=80 bash LeSONIC/scripts/gear_sonic_live.sh kick # one-shot: trigger bootstrap then live
# strict termination (real stability test): RELAX=0 bash LeSONIC/scripts/gear_sonic_live.sh macarena
Architecture
prompt ββ
ego cam ββΌββΊ GR00T N1.7 ββ(ZMQ)βββΊ 64-d motion_token βββΊ SONIC WBC decode βββΊ 29-DoF G1 (balanced)
proprio ββ (one model) (frozen controller)
Skills live in the token space, not in GR00T. GR00T only learns prompt β token; the
pretrained WBC supplies balance and recovery.
Citation / links
- Validation & critique write-ups (single-file HTML, in the code repo):
LeSONIC/doc/groot_sonic_wbc_route.html,LeSONIC/doc/sonic_vla_closeloop_validation.html,LeSONIC/doc/sonic_vla_critique_roadmap.html - GEAR-SONIC WBC: https://github.com/NVlabs/GR00T-WholeBodyControl
- NVIDIA Isaac GR00T: https://github.com/NVIDIA/Isaac-GR00T
- Downloads last month
- 104
Model tree for wsagi/GR00T-N1.7-G1-SONIC
Base model
nvidia/GR00T-N1.7-3B