GR00T-N1.7 Γ— GEAR-SONIC β€” prompt-conditioned whole-body motion for Unitree G1

⚠️ Proof-of-concept (derisk checkpoint) β€” read first

This model memorizes 7 motions, 1 episode each. Generalization to new prompts or new motions is untested. The headline MSE 0.0011 is train-set reconstruction error (no held-out split), i.e. evidence of memorization β€” not generalization. Every demo clip below was recorded with deviation/fall terminations disabled (RELAX=1, thresholds = 99), so the clips show prompt β†’ token tracking, not validated balance or fall-resistance. It is not a deployable general humanoid VLA. Full caveats in Honest scope & limitations.

A single GR00T N1.7 vision-language-action model fine-tuned to emit the 64-dim FSQ motion token of the GEAR-SONIC whole-body controller (WBC). One model, one prompt β†’ the SONIC WBC decodes a full-body G1 motion (the WBC supplies balance/recovery). No per-joint behavior cloning.

One prompt-conditioned model, not 7 separate ones. prompt β†’ GR00T β†’ 64-d token β†’ SONIC.decode β†’ 29-DoF motion.

πŸ”— Code, training & validation scaffolding: vitorcen/isaaclab-experience (LeSONIC/scripts/gear_sonic_live.sh, gear_sonic/data/vla_live_injector.py)


πŸŽ₯ Closed-loop demos (GR00T-in-the-loop, no C++/DDS)

Each clip below is the live closed loop: every control step the policy reads the G1's ego camera + proprioception + the prompt, GR00T predicts a token chunk over ZMQ, and SONIC's WBC decodes it to joint targets. The green skeleton is the reference target; the robot follows GR00T's token. <motion> = the prompt.

⚠️ All clips run with RELAX=1 β€” the deviation/fall terminations (anchor_pos, ee_body_pos, anchor_ori_full, foot_pos_xyz) are set to threshold 99, i.e. effectively off, so the full motion plays without the normal 1–2 s deviation reset. These clips therefore demonstrate prompt β†’ token tracking under relaxed termination, not fall-resistance. "Self-sustaining" below means continues without bootstrap, not "guaranteed upright under strict termination" (that is untested β€” see scope).

Self-sustaining motions (pure live GR00T, no bootstrap)

macarena β€” "dance the macarena"

dance β€” "dance"

squat β€” "squat"

lunge β€” "do a forward lunge"

One-shot / traverse motions β€” require bootstrap

These do not self-start from rest (see scope); the clips were produced with the bootstrap workaround (replay open-loop dump tokens to enter the motion, then hand to live GR00T).

kick β€” "kick" (bootstrap)

walk β€” "walk and turn around" (bootstrap)

jump β€” "jump on one leg" (borderline; bootstrap)


What this is

Base NVIDIA Isaac GR00T N1.7 (Gr00tN1d7), VLM backbone nvidia/Cosmos-Reason2-2B (frozen), action_horizon 40
Task prompt + ego-view + proprioception β†’ action.motion_token (64-d FSQ)
Embodiment unitree_g1_sonic (29-DoF G1 + projected gravity)
Decoder GEAR-SONIC UniversalTokenModule.decode("g1_dyn", …) (released WBC checkpoint)
Training DiT action head + projector fine-tuned, Cosmos VLM frozen; bf16, adafactor, grad-ckpt, single RTX 4090, 8000 steps
Data 7 SONIC deploy-demo motions (dance, macarena, kick, lunge, squat, jump, walk), 1 episode each, recorded as (token, ego_rgb, proprio)
Weights currently stored fp32 (β‰ˆ12 GB, 3 shards); training was bf16 β€” a bf16 re-save (β‰ˆ6 GB) is planned

Results β€” and how to read them

ℹ️ These numbers are measured on the 7 training trajectories with no held-out split. They quantify fit to the training set, not generalization. Read each with its caveat.

  • Open-loop token MSE = 0.0011 vs ground-truth tokens, ~35Γ— below the predict-per-motion-mean baseline (0.039). Caveat: this is train-set reconstruction error β€” strong evidence the model memorized the 7 token sequences. The per-motion-mean baseline is weak for a 7-way set; a real generalization test (held-out frames / held-out motion / paraphrased prompt) is planned, not yet run.
  • Prompt is a control knob: with a fixed observation, swapping only the prompt shifts the predicted token by L2 3.2–5.4 (token std β‰ˆ 0.18). Caveat: the observation dominates the prompt (obs L2 β‰ˆ 11.8 vs prompt L2 β‰ˆ 3–5) β€” cleanest prompt-driven selection is from a neutral rest pose; mid-motion the prompt's marginal influence is small.
  • Closed-loop self-sustains 4–5 / 7 motions (squat, lunge, dance, macarena) with pure live GR00T; kick/walk/jump need bootstrap. Caveat: measured under RELAX=1 (fall/deviation termination disabled). The pelvis stays β‰ˆ 0.79 m in these clips, but this is not a fall-resistance claim β€” the fall criterion was switched off. Stability under strict termination, success-rate, fall-rate and phase-drift are untested (planned: strict-termination re-eval over multiple seeds, bootstrap / no-bootstrap reported separately).
  • Checkpoint-8000 was chosen by monotonically-decreasing open-loop MSE on the training trajectories β€” there is no validation curve, so the selection is on train-set fit.

⚠️ Honest scope & limitations

This is a derisk / proof-of-concept checkpoint, not a general humanoid VLA. Please read before using:

  • Training-set only β€” generalization untested. Trained on 7 motions, 1 episode each; new-motion / new-prompt held-out generalization has not been evaluated. Treat as memorization of these 7 skills. With 1.62 B trainable params over ~3815 frames, the action head can simply memorize the trajectories.
  • Memoryless single-frame policy (architectural). GR00T predicts a 40-step token chunk from a single frame of obs β€” no history, no phase/progress signal. At the standing rest pose it can't tell "about to kick" from "just finished", so single-shot return-to-rest motions (kick, walk, jump) settle into standing and need the bootstrap workaround (no retraining): BOOTSTRAP=80 bash LeSONIC/scripts/gear_sonic_live.sh kick. This is an observation-design limitation, not a tuning one β€” adding data alone will not fix it.
  • FSQ token modelling caveat. The 64-dim token is discrete (FSQ, 16 levels/dim); the action head regresses it as a continuous flow-matching target. On-distribution this lands near the grid; off-distribution predictions can fall between grid points. A discrete objective (per-dim CE / a pre-quantization latent passed through SONIC's quantizer) is the cleaner formulation (future work).
  • The balance is the WBC's, not this model's. "Doesn't fall" in the demos is largely the pretrained SONIC controller; the VLA's contribution is not yet decoupled by ablation (WBC-replay / mean-token / prompt-shuffle / proprio-only / vision-off all planned).
  • Ego camera looks at the ground. It is rigidly attached to the pelvis facing forward and does not see the robot's own body β€” the visual signal for dance/pose is weak; the policy leans on proprioception + prompt.
  • Not runnable standalone. Requires the GEAR-SONIC WBC checkpoint + the unitree_g1_sonic embodiment config + the live injector + a running GR00T ZMQ server (GR00T and SONIC can't share a process β€” transformers version conflict). See the repo for env lockfile, ZMQ wire schema, token extraction and eval commands.

How to run (closed loop)

# 1. start the GR00T inference server (in the Isaac-GR00T venv)
python -m gr00t.eval.run_gr00t_server \
    --model_path <this_checkpoint> --embodiment_tag unitree_g1_sonic --port 5555

# 2. drive SONIC's WBC live, in the Isaac viewer (isaaclab env)
bash LeSONIC/scripts/gear_sonic_live.sh macarena          # self-sustaining: pure live (RELAX=1 by default)
BOOTSTRAP=80 bash LeSONIC/scripts/gear_sonic_live.sh kick  # one-shot: trigger bootstrap then live
# strict termination (real stability test): RELAX=0 bash LeSONIC/scripts/gear_sonic_live.sh macarena

Architecture

prompt ─┐
ego cam ─┼─► GR00T N1.7 ──(ZMQ)──► 64-d motion_token ──► SONIC WBC decode ──► 29-DoF G1 (balanced)
proprio β”€β”˜     (one model)                                (frozen controller)

Skills live in the token space, not in GR00T. GR00T only learns prompt β†’ token; the pretrained WBC supplies balance and recovery.

Citation / links

Downloads last month
104
Safetensors
Model size
3B params
Tensor type
BF16
Β·
Video Preview
loading

Model tree for wsagi/GR00T-N1.7-G1-SONIC

Finetuned
(22)
this model

Dataset used to train wsagi/GR00T-N1.7-G1-SONIC

Collection including wsagi/GR00T-N1.7-G1-SONIC