GR00T-N1.7 · RoboCasa OpenCabinet (MimicGen two-stage)

A single-task fine-tune of NVIDIA GR00T-N1.7-3B on the RoboCasa OpenCabinet task, trained on a single RTX 4090 (24 GB).

This main checkpoint (step 14000) is the best of a two-stage MimicGen recipe. The previous human-only model (ckpt-11000) is preserved on the branch human500-ckpt-11000.

Headline

The MimicGen two-stage lifts N1.7 on OpenCabinet from the human-only baseline (43–50%) to **53%** on a fair 30/30 benchmark. It does not overtake the much larger multi-task GR00T-N1.5 (~70%) — reported honestly below.

Two-stage recipe (what worked)

Stage 1 — MimicGen-mix pretrain. Warm-start from nvidia/GR00T-N1.7-3B, train on a native mix of 8644 MimicGen-generated episodes + 500 human demos (mix_ratio human 1.0 : MG 3.0). No physical dataset merge — GR00T's data factory mixes the two LeRobot sets by weight, so v2.0/v2.1 format and task-index differences never matter. Peak ~step 34000.
Stage 2 — pure-human finetune. Continue from the stage-1 ckpt on pure human data so the final policy re-aligns to the human evaluation distribution. Peak at step 14000 (this model).

Key lessons (the recipe is the point, not just the weights):

A from-scratch 75% MimicGen mix peaks ~53% then overfits — MimicGen alone dilutes the in-distribution human signal. The pure-human stage-2 is what recovers eval performance.
Eval at 1200 sim-steps, not 400 — successes here average ~510 steps; a 400-step cap reports near-0 for slow-but-successful rollouts.
Continuing past the peak with the original LR schedule collapses the policy (LR re-injection shock); a gentle low-LR continuation holds but does not improve — step 14000 is the knee.

Demo rollouts

Direct links: demo-1 · demo-2

Fair benchmark — RoboCasa OpenCabinet (target split)

30 rounds · 1200 max env-steps · n_action_steps=16 · seed_base=0 (the same 30 seed-locked scenes, one fixed cabinet layout, for every policy). DNF episodes were retried to completion (EVAL_EP_RETRIES=10), so all policies count a full 30/30 — no exclusion bias.

#	Policy	Success rate	Successes	Mean steps (success)
1	GR00T-N1.5-multitask (downloaded reference)	70.0%	21/30	516
2	GR00T-N1.7-MG2stage (this model, ckpt-14000)	53.3%	16/30	514
3	pi0.5-pretrain-human300 (downloaded reference)	23.3%	7/30	448

Why this differs from earlier "69%" numbers. An earlier run excluded crashed (sim-DNF) episodes, which unevenly flattered policies (it inflated this model to 69% while deflating N1.5 to 64%). Retrying every episode to a real 30/30 removes that bias — and N1.5 comes out clearly ahead. Closed-loop variance is high (±~8–9% per 30-round; this model spanned 53–69% across runs); treat these as single-run point estimates.

Honesty

N1.5 (70%) leads. It is a downloaded multi-task model with large-scale pretraining (120k steps); this is a single-task RoboCasa fine-tune. MimicGen narrows but does not close that gap.
MimicGen still helped: vs the human-only ckpt-11000 (human500-ckpt-11000 branch, ~43% on the same over-30 basis), the two-stage recipe's 53.3% is a real ~10-point gain.
Eval uses driver/worker subprocess isolation (one fresh subprocess per episode): CPython 3.11's adaptive interpreter surfaces MuJoCo GL/EGL heap corruption as bizarre errors across repeated env.reset(); isolating episodes makes whole-run crashes impossible.

Training recipe


Base model	`nvidia/GR00T-N1.7-3B`
Task	RoboCasa OpenCabinet (target split)
Data	Stage 1: 8644 MimicGen ep + 500 human (native mix). Stage 2: 500 human
Hardware	1× RTX 4090 24 GB, bf16
Optimizer	adafactor + gradient checkpointing
Batch	global 4 / grad-accum 4 (micro-batch 1), `num_workers=0`
Best step	14000 (stage-2)

Trainable: action-head DiT + projector + linear heads + VL-LN (~600 M). Frozen: Cosmos vision encoder + LLM backbone (tune_top_llm_layers=0).

Dataset

Trained on RoboCasa OpenCabinet data generated from the RoboCasa simulator / assets (robocasa/robocasa-assets): the OpenCabinet subset only — 500 human teleoperation demos + 8644 MimicGen-generated episodes (LeRobot format, 3 cameras @ 256×256, 20 fps). This is a small slice of the full multi-task RoboCasa data.

Download & run eval yourself

# 1. checkpoint (this repo, main = best)
huggingface-cli download wsagi/GR00T-N1.7-RoboCasa-OpenCabinet \
    --local-dir ./ckpt --revision main

# 2. RoboCasa env + assets (sim is pure — no demo dataset needed to *eval*)
pip install robocasa   # see github.com/robocasa for full setup
python -m robocasa.scripts.download_kitchen_assets

# 3. seed-locked eval, fair 30/30 (retry DNFs to completion)
SEED_BASE=0 EVAL_EP_RETRIES=10 \
  python scripts/eval_gr00t_n17.py \
    --env-name OpenCabinet --split target \
    --n-episodes 30 --max-steps 1200 --n-action-steps 16 \
    --ckpt ./ckpt/checkpoint-14000 \
    --results-path results.json

Project repos

robocasa-training — training scripts, eval pipeline, benchmark, MimicGen recipe
mujoco-experience — project hub (scene library, previews, install scripts)

Downloads last month: -; Downloads are not tracked for this model. How to track

Video Preview

Robotics

Model tree for wsagi/GR00T-N1.7-RoboCasa-OpenCabinet

Base model

nvidia/GR00T-N1.7-3B

Finetuned

(22)

this model

Dataset used to train wsagi/GR00T-N1.7-RoboCasa-OpenCabinet

Collection including wsagi/GR00T-N1.7-RoboCasa-OpenCabinet

RoboCasa

Collection

GR00T policies on RoboCasa kitchen manipulation tasks. • 1 item • Updated 1 day ago