SmolVLA fine-tuned on RoboTwin stack_bowls_two

SmolVLA policy (450M params) fine-tuned on dual-arm bowl-stacking task in RoboTwin 2.0 simulator.

Training data

  • Source: RoboTwin stack_bowls_two task, demo_clean config (D435 cameras at 320x240).
  • Episodes: 223 (collected via collect_data.sh stack_bowls_two demo_clean).
  • Frames: ~47k @ effective 16.67 Hz.
  • Images: native 240x320 (no offline resize; aspect-preserving letterbox via model's resize_with_pad=[512,512]).
  • State / Action: 16-D dual-arm EEF (pos3 + quat4 + grip1 per arm).
  • Language instruction: fixed "stack the bowls" for all episodes (strategy A: single instruction to avoid spurious 1:1 hash).

Training config

  • Batch size 16, 20000 steps, bf16, cosine warmup 1000 / decay 20000.
  • Base: lerobot/smolvla_base (full fine-tune, vision encoder unfrozen).
  • chunk_size=50, n_action_steps=50.
  • rename_map: dual_cam_global -> camera1, cam_wrist_65 -> camera2, cam_wrist_75 -> camera3.

Evaluation (RoboTwin sim, max_steps=400, 10 episodes)

Success rate: 5/10 (50%) with task_text="stack the bowls" and --skip_resize.

Episode Result Steps
0 FAIL 400 (timeout)
1 SUCCESS 290
2 FAIL 400
3 SUCCESS 290
4 FAIL 400
5 SUCCESS 290
6 FAIL 400
7 SUCCESS 270
8 FAIL 400
9 SUCCESS 299

Usage

from lerobot.policies.smolvla.modeling_smolvla import SmolVLAPolicy
policy = SmolVLAPolicy.from_pretrained("arrow-hf/smolvla-robotwin-stack-bowls-two-50pct")

At inference, feed native-resolution images (e.g., 240x320 from RoboTwin D435) — the model's internal resize_with_pad handles target shape with letterbox.

Downloads last month
58
Safetensors
Model size
0.5B params
Tensor type
F32
·
BF16
·
Video Preview
loading