SmolVLA fine-tuned on RoboTwin stack_bowls_two

SmolVLA policy (450M params) fine-tuned on dual-arm bowl-stacking task in RoboTwin 2.0 simulator.

Training data

Source: RoboTwin stack_bowls_two task, demo_clean config (D435 cameras at 320x240).
Episodes: 223 (collected via collect_data.sh stack_bowls_two demo_clean).
Frames: ~47k @ effective 16.67 Hz.
Images: native 240x320 (no offline resize; aspect-preserving letterbox via model's resize_with_pad=[512,512]).
State / Action: 16-D dual-arm EEF (pos3 + quat4 + grip1 per arm).
Language instruction: fixed "stack the bowls" for all episodes (strategy A: single instruction to avoid spurious 1:1 hash).

Training config

Batch size 16, 20000 steps, bf16, cosine warmup 1000 / decay 20000.
Base: lerobot/smolvla_base (full fine-tune, vision encoder unfrozen).
chunk_size=50, n_action_steps=50.
rename_map: dual_cam_global -> camera1, cam_wrist_65 -> camera2, cam_wrist_75 -> camera3.

Evaluation (RoboTwin sim, max_steps=400, 10 episodes)

Success rate: 5/10 (50%) with task_text="stack the bowls" and --skip_resize.

Episode	Result	Steps
0	FAIL	400 (timeout)
1	SUCCESS	290
2	FAIL	400
3	SUCCESS	290
4	FAIL	400
5	SUCCESS	290
6	FAIL	400
7	SUCCESS	270
8	FAIL	400
9	SUCCESS	299

Usage

from lerobot.policies.smolvla.modeling_smolvla import SmolVLAPolicy
policy = SmolVLAPolicy.from_pretrained("arrow-hf/smolvla-robotwin-stack-bowls-two-50pct")

At inference, feed native-resolution images (e.g., 240x320 from RoboTwin D435) — the model's internal resize_with_pad handles target shape with letterbox.

Downloads last month: 58

Safetensors

Model size

0.5B params

Tensor type

F32

BF16

Video Preview

Robotics