GR00T-N1.7 · RoboCasa OpenCabinet (MimicGen two-stage)
A single-task fine-tune of NVIDIA GR00T-N1.7-3B on the RoboCasa OpenCabinet task, trained on a single RTX 4090 (24 GB).
This main checkpoint (step 14000) is the best of a two-stage MimicGen recipe. The
previous human-only model (ckpt-11000) is preserved on the branch
human500-ckpt-11000.
Headline
The MimicGen two-stage lifts N1.7 on OpenCabinet from the human-only baseline (43–50%) to
**53%** on a fair 30/30 benchmark. It does not overtake the much larger multi-task
GR00T-N1.5 (~70%) — reported honestly below.
Two-stage recipe (what worked)
- Stage 1 — MimicGen-mix pretrain. Warm-start from
nvidia/GR00T-N1.7-3B, train on a native mix of 8644 MimicGen-generated episodes + 500 human demos (mix_ratio human 1.0 : MG 3.0). No physical dataset merge — GR00T's data factory mixes the two LeRobot sets by weight, so v2.0/v2.1 format and task-index differences never matter. Peak ~step 34000. - Stage 2 — pure-human finetune. Continue from the stage-1 ckpt on pure human data so the final policy re-aligns to the human evaluation distribution. Peak at step 14000 (this model).
Key lessons (the recipe is the point, not just the weights):
- A from-scratch 75% MimicGen mix peaks ~53% then overfits — MimicGen alone dilutes the in-distribution human signal. The pure-human stage-2 is what recovers eval performance.
- Eval at 1200 sim-steps, not 400 — successes here average ~510 steps; a 400-step cap reports near-0 for slow-but-successful rollouts.
- Continuing past the peak with the original LR schedule collapses the policy (LR re-injection shock); a gentle low-LR continuation holds but does not improve — step 14000 is the knee.
Demo rollouts
Fair benchmark — RoboCasa OpenCabinet (target split)
30 rounds · 1200 max env-steps · n_action_steps=16 · seed_base=0 (the same 30
seed-locked scenes, one fixed cabinet layout, for every policy). DNF episodes were retried to
completion (EVAL_EP_RETRIES=10), so all policies count a full 30/30 — no exclusion bias.
| # | Policy | Success rate | Successes | Mean steps (success) |
|---|---|---|---|---|
| 1 | GR00T-N1.5-multitask (downloaded reference) | 70.0% | 21/30 | 516 |
| 2 | GR00T-N1.7-MG2stage (this model, ckpt-14000) | 53.3% | 16/30 | 514 |
| 3 | pi0.5-pretrain-human300 (downloaded reference) | 23.3% | 7/30 | 448 |
Why this differs from earlier "69%" numbers. An earlier run excluded crashed (sim-DNF) episodes, which unevenly flattered policies (it inflated this model to 69% while deflating N1.5 to 64%). Retrying every episode to a real 30/30 removes that bias — and N1.5 comes out clearly ahead. Closed-loop variance is high (±~8–9% per 30-round; this model spanned 53–69% across runs); treat these as single-run point estimates.
Honesty
- N1.5 (70%) leads. It is a downloaded multi-task model with large-scale pretraining (120k steps); this is a single-task RoboCasa fine-tune. MimicGen narrows but does not close that gap.
- MimicGen still helped: vs the human-only ckpt-11000 (
human500-ckpt-11000branch, ~43% on the same over-30 basis), the two-stage recipe's 53.3% is a real ~10-point gain. - Eval uses driver/worker subprocess isolation (one fresh subprocess per episode): CPython
3.11's adaptive interpreter surfaces MuJoCo GL/EGL heap corruption as bizarre errors across
repeated
env.reset(); isolating episodes makes whole-run crashes impossible.
Training recipe
| Base model | nvidia/GR00T-N1.7-3B |
| Task | RoboCasa OpenCabinet (target split) |
| Data | Stage 1: 8644 MimicGen ep + 500 human (native mix). Stage 2: 500 human |
| Hardware | 1× RTX 4090 24 GB, bf16 |
| Optimizer | adafactor + gradient checkpointing |
| Batch | global 4 / grad-accum 4 (micro-batch 1), num_workers=0 |
| Best step | 14000 (stage-2) |
Trainable: action-head DiT + projector + linear heads + VL-LN (~600 M). Frozen: Cosmos vision
encoder + LLM backbone (tune_top_llm_layers=0).
Dataset
Trained on RoboCasa OpenCabinet data generated from the RoboCasa simulator / assets
(robocasa/robocasa-assets):
the OpenCabinet subset only — 500 human teleoperation demos + 8644 MimicGen-generated
episodes (LeRobot format, 3 cameras @ 256×256, 20 fps). This is a small slice of the full
multi-task RoboCasa data.
Download & run eval yourself
# 1. checkpoint (this repo, main = best)
huggingface-cli download wsagi/GR00T-N1.7-RoboCasa-OpenCabinet \
--local-dir ./ckpt --revision main
# 2. RoboCasa env + assets (sim is pure — no demo dataset needed to *eval*)
pip install robocasa # see github.com/robocasa for full setup
python -m robocasa.scripts.download_kitchen_assets
# 3. seed-locked eval, fair 30/30 (retry DNFs to completion)
SEED_BASE=0 EVAL_EP_RETRIES=10 \
python scripts/eval_gr00t_n17.py \
--env-name OpenCabinet --split target \
--n-episodes 30 --max-steps 1200 --n-action-steps 16 \
--ckpt ./ckpt/checkpoint-14000 \
--results-path results.json
Project repos
- robocasa-training — training scripts, eval pipeline, benchmark, MimicGen recipe
- mujoco-experience — project hub (scene library, previews, install scripts)
Model tree for wsagi/GR00T-N1.7-RoboCasa-OpenCabinet
Base model
nvidia/GR00T-N1.7-3B