Instructions to use arrow-hf/smolvla-robotwin-stack-bowls-two-50pct with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- LeRobot
How to use arrow-hf/smolvla-robotwin-stack-bowls-two-50pct with LeRobot:
# See https://github.com/huggingface/lerobot?tab=readme-ov-file#installation for more details git clone https://github.com/huggingface/lerobot.git cd lerobot pip install -e .[smolvla]
# Launch finetuning on your dataset python lerobot/scripts/train.py \ --policy.path=arrow-hf/smolvla-robotwin-stack-bowls-two-50pct \ --dataset.repo_id=lerobot/svla_so101_pickplace \ --batch_size=64 \ --steps=20000 \ --output_dir=outputs/train/my_smolvla \ --job_name=my_smolvla_training \ --policy.device=cuda \ --wandb.enable=true
# Run the policy using the record function python -m lerobot.record \ --robot.type=so101_follower \ --robot.port=/dev/ttyACM0 \ # <- Use your port --robot.id=my_blue_follower_arm \ # <- Use your robot id --robot.cameras="{ front: {type: opencv, index_or_path: 8, width: 640, height: 480, fps: 30}}" \ # <- Use your cameras --dataset.single_task="Grasp a lego block and put it in the bin." \ # <- Use the same task description you used in your dataset recording --dataset.repo_id=HF_USER/dataset_name \ # <- This will be the dataset name on HF Hub --dataset.episode_time_s=50 \ --dataset.num_episodes=10 \ --policy.path=arrow-hf/smolvla-robotwin-stack-bowls-two-50pct - Notebooks
- Google Colab
- Kaggle
SmolVLA fine-tuned on RoboTwin stack_bowls_two
SmolVLA policy (450M params) fine-tuned on dual-arm bowl-stacking task in RoboTwin 2.0 simulator.
Training data
- Source: RoboTwin
stack_bowls_twotask,demo_cleanconfig (D435 cameras at 320x240). - Episodes: 223 (collected via
collect_data.sh stack_bowls_two demo_clean). - Frames: ~47k @ effective 16.67 Hz.
- Images: native 240x320 (no offline resize; aspect-preserving letterbox via model's
resize_with_pad=[512,512]). - State / Action: 16-D dual-arm EEF (pos3 + quat4 + grip1 per arm).
- Language instruction: fixed
"stack the bowls"for all episodes (strategy A: single instruction to avoid spurious 1:1 hash).
Training config
- Batch size 16, 20000 steps, bf16, cosine warmup 1000 / decay 20000.
- Base:
lerobot/smolvla_base(full fine-tune, vision encoder unfrozen). - chunk_size=50, n_action_steps=50.
- rename_map:
dual_cam_global -> camera1, cam_wrist_65 -> camera2, cam_wrist_75 -> camera3.
Evaluation (RoboTwin sim, max_steps=400, 10 episodes)
Success rate: 5/10 (50%) with task_text="stack the bowls" and --skip_resize.
| Episode | Result | Steps |
|---|---|---|
| 0 | FAIL | 400 (timeout) |
| 1 | SUCCESS | 290 |
| 2 | FAIL | 400 |
| 3 | SUCCESS | 290 |
| 4 | FAIL | 400 |
| 5 | SUCCESS | 290 |
| 6 | FAIL | 400 |
| 7 | SUCCESS | 270 |
| 8 | FAIL | 400 |
| 9 | SUCCESS | 299 |
Usage
from lerobot.policies.smolvla.modeling_smolvla import SmolVLAPolicy
policy = SmolVLAPolicy.from_pretrained("arrow-hf/smolvla-robotwin-stack-bowls-two-50pct")
At inference, feed native-resolution images (e.g., 240x320 from RoboTwin D435) — the model's internal resize_with_pad handles target shape with letterbox.
- Downloads last month
- 58