GR00T-N1.7 · RoboCasa OpenCabinet (MimicGen two-stage)

A single-task fine-tune of NVIDIA GR00T-N1.7-3B on the RoboCasa OpenCabinet task, trained on a single RTX 4090 (24 GB).

This main checkpoint (step 14000) is the best of a two-stage MimicGen recipe. The previous human-only model (ckpt-11000) is preserved on the branch human500-ckpt-11000.

Headline

The MimicGen two-stage lifts N1.7 on OpenCabinet from the human-only baseline (43–50%) to **53%** on a fair 30/30 benchmark. It does not overtake the much larger multi-task GR00T-N1.5 (~70%) — reported honestly below.

Two-stage recipe (what worked)

  1. Stage 1 — MimicGen-mix pretrain. Warm-start from nvidia/GR00T-N1.7-3B, train on a native mix of 8644 MimicGen-generated episodes + 500 human demos (mix_ratio human 1.0 : MG 3.0). No physical dataset merge — GR00T's data factory mixes the two LeRobot sets by weight, so v2.0/v2.1 format and task-index differences never matter. Peak ~step 34000.
  2. Stage 2 — pure-human finetune. Continue from the stage-1 ckpt on pure human data so the final policy re-aligns to the human evaluation distribution. Peak at step 14000 (this model).

Key lessons (the recipe is the point, not just the weights):

  • A from-scratch 75% MimicGen mix peaks ~53% then overfits — MimicGen alone dilutes the in-distribution human signal. The pure-human stage-2 is what recovers eval performance.
  • Eval at 1200 sim-steps, not 400 — successes here average ~510 steps; a 400-step cap reports near-0 for slow-but-successful rollouts.
  • Continuing past the peak with the original LR schedule collapses the policy (LR re-injection shock); a gentle low-LR continuation holds but does not improve — step 14000 is the knee.

Demo rollouts

Direct links: demo-1 · demo-2

Fair benchmark — RoboCasa OpenCabinet (target split)

30 rounds · 1200 max env-steps · n_action_steps=16 · seed_base=0 (the same 30 seed-locked scenes, one fixed cabinet layout, for every policy). DNF episodes were retried to completion (EVAL_EP_RETRIES=10), so all policies count a full 30/30 — no exclusion bias.

# Policy Success rate Successes Mean steps (success)
1 GR00T-N1.5-multitask (downloaded reference) 70.0% 21/30 516
2 GR00T-N1.7-MG2stage (this model, ckpt-14000) 53.3% 16/30 514
3 pi0.5-pretrain-human300 (downloaded reference) 23.3% 7/30 448

Why this differs from earlier "69%" numbers. An earlier run excluded crashed (sim-DNF) episodes, which unevenly flattered policies (it inflated this model to 69% while deflating N1.5 to 64%). Retrying every episode to a real 30/30 removes that bias — and N1.5 comes out clearly ahead. Closed-loop variance is high (±~8–9% per 30-round; this model spanned 53–69% across runs); treat these as single-run point estimates.

Honesty

  • N1.5 (70%) leads. It is a downloaded multi-task model with large-scale pretraining (120k steps); this is a single-task RoboCasa fine-tune. MimicGen narrows but does not close that gap.
  • MimicGen still helped: vs the human-only ckpt-11000 (human500-ckpt-11000 branch, ~43% on the same over-30 basis), the two-stage recipe's 53.3% is a real ~10-point gain.
  • Eval uses driver/worker subprocess isolation (one fresh subprocess per episode): CPython 3.11's adaptive interpreter surfaces MuJoCo GL/EGL heap corruption as bizarre errors across repeated env.reset(); isolating episodes makes whole-run crashes impossible.

Training recipe

Base model nvidia/GR00T-N1.7-3B
Task RoboCasa OpenCabinet (target split)
Data Stage 1: 8644 MimicGen ep + 500 human (native mix). Stage 2: 500 human
Hardware 1× RTX 4090 24 GB, bf16
Optimizer adafactor + gradient checkpointing
Batch global 4 / grad-accum 4 (micro-batch 1), num_workers=0
Best step 14000 (stage-2)

Trainable: action-head DiT + projector + linear heads + VL-LN (~600 M). Frozen: Cosmos vision encoder + LLM backbone (tune_top_llm_layers=0).

Dataset

Trained on RoboCasa OpenCabinet data generated from the RoboCasa simulator / assets (robocasa/robocasa-assets): the OpenCabinet subset only — 500 human teleoperation demos + 8644 MimicGen-generated episodes (LeRobot format, 3 cameras @ 256×256, 20 fps). This is a small slice of the full multi-task RoboCasa data.

Download & run eval yourself

# 1. checkpoint (this repo, main = best)
huggingface-cli download wsagi/GR00T-N1.7-RoboCasa-OpenCabinet \
    --local-dir ./ckpt --revision main

# 2. RoboCasa env + assets (sim is pure — no demo dataset needed to *eval*)
pip install robocasa   # see github.com/robocasa for full setup
python -m robocasa.scripts.download_kitchen_assets

# 3. seed-locked eval, fair 30/30 (retry DNFs to completion)
SEED_BASE=0 EVAL_EP_RETRIES=10 \
  python scripts/eval_gr00t_n17.py \
    --env-name OpenCabinet --split target \
    --n-episodes 30 --max-steps 1200 --n-action-steps 16 \
    --ckpt ./ckpt/checkpoint-14000 \
    --results-path results.json

Project repos

  • robocasa-training — training scripts, eval pipeline, benchmark, MimicGen recipe
  • mujoco-experience — project hub (scene library, previews, install scripts)
Downloads last month

-

Downloads are not tracked for this model. How to track
Video Preview
loading

Model tree for wsagi/GR00T-N1.7-RoboCasa-OpenCabinet

Finetuned
(22)
this model

Dataset used to train wsagi/GR00T-N1.7-RoboCasa-OpenCabinet

Collection including wsagi/GR00T-N1.7-RoboCasa-OpenCabinet