smolVLA-UR7e-CaP_arrange_block_10fps

This repository contains a LeRobot SmolVLA policy fine-tuned for a UR7e block-arrangement task. The policy was trained on demonstrations from CoRL2026-CSI/UR7e-CaP_arrange_block_100epi_10fps, where the robot arranges red, green, and blue blocks along a purple line from left to right.

The checkpoint is intended for research use with LeRobot-compatible inference pipelines. No real-robot or offline success-rate evaluation is included in this model card; the reported metrics are training logs only.

Model Details

  • Model type: SmolVLA vision-language-action policy
  • Base policy: lerobot/smolvla_base
  • VLM backbone: HuggingFaceTB/SmolVLM2-500M-Video-Instruct
  • Robot: UR7e
  • Task: Arrange red, green, blue blocks along a purple line from left to right
  • Training framework: LeRobot
  • Checkpoint format: safetensors
  • License: Apache 2.0

Dataset

The policy was trained on CoRL2026-CSI/UR7e-CaP_arrange_block_100epi_10fps, a LeRobot dataset collected for the UR7e block-arrangement task.

Dataset summary:

Field Value
Robot type ur7e
Episodes 100
Frames 47,116
Dataset FPS 10
Tasks 1
Split train: 0:100
Cameras RealSense wrist and top-view RGB video
Camera resolution 480 x 640 RGB video
Dataset state/action vectors 7D joint/gripper vector

The dataset includes additional skill annotations such as skill.type, skill.progress, target joint positions, target Cartesian poses, and natural-language skill text. The policy checkpoint uses the LeRobot preprocessing pipeline saved in this repository.

Policy Inputs and Outputs

The saved policy configuration expects the following model features after preprocessing:

Inputs, according to the saved policy config:

  • observation.state: 6D state feature
  • observation.images.camera1: wrist camera, resized/padded for SmolVLA
  • observation.images.camera2: top-view camera, resized/padded for SmolVLA
  • observation.images.camera3: visual input slot
  • observation.images.empty_camera_0: empty camera placeholder

Output, according to the saved policy config:

  • action: 7D joint/gripper action vector

The included policy_preprocessor.json maps dataset camera names to model camera names:

  • observation.images.realsense_wrist -> observation.images.camera1
  • observation.images.realsense_topview -> observation.images.camera2

State and action features use mean/std normalization. Visual features use identity normalization. The postprocessor unnormalizes the action output and moves it back to CPU.

Training Details

The final uploaded checkpoint is from step 9203.

Setting Value
Training steps 9,203
Approx. epochs 50
Batch size 128
Gradient accumulation 1
Seed 1000
Optimizer AdamW
Peak learning rate 1e-4
Weight decay 1e-10
Gradient clipping 10.0
Scheduler Cosine decay with warmup
Warmup steps 1,000
Decay steps 30,000
Final decay LR 2.5e-6
AMP Disabled
PEFT Disabled
Vision encoder Frozen
Expert-only training Enabled
State projection training Enabled
Action chunk size 50
Observation steps 1
Action steps 50

Image augmentation was enabled during training with up to two randomly ordered transforms per sample:

  • brightness jitter: [0.8, 1.2]
  • contrast jitter: [0.8, 1.2]
  • saturation jitter: [0.5, 1.5]
  • hue jitter: [-0.05, 0.05]
  • sharpness jitter: [0.5, 1.5]
  • random affine rotation: [-5, 5] degrees
  • random affine translation: 0.05

Training logs:

Metric Value
Final logged training loss 0.010
Mean training loss over last 20 logged points 0.01045
Final logged gradient norm 0.101
Final logged learning rate 2.5e-6

These values are training-loop logs only and should not be interpreted as task success rates.

How to Use

Install LeRobot and load the policy from the Hub:

from lerobot.policies.smolvla.modeling_smolvla import SmolVLAPolicy

policy = SmolVLAPolicy.from_pretrained(
    "CoRL2026-CSI/smolVLA-UR7e-CaP_arrange_block_10fps"
)
policy.to("cuda")
policy.eval()

For robot rollout or evaluation, use the LeRobot CLI or your existing UR7e control stack with --policy.path pointing to this repository:

lerobot-record \
  --policy.path=CoRL2026-CSI/smolVLA-UR7e-CaP_arrange_block_10fps \
  --dataset.repo_id=CoRL2026-CSI/eval_smolVLA-UR7e-CaP_arrange_block_10fps

Adjust the robot, camera, and dataset arguments to match the local UR7e deployment setup.

Files

This repository contains:

  • model.safetensors: policy weights
  • config.json: policy configuration
  • train_config.json: LeRobot training configuration
  • policy_preprocessor.json: saved inference preprocessing pipeline
  • policy_preprocessor_step_5_normalizer_processor.safetensors: normalization state
  • policy_postprocessor.json: saved inference postprocessing pipeline
  • policy_postprocessor_step_0_unnormalizer_processor.safetensors: action unnormalization state

Evaluation

No evaluation run is reported for this checkpoint. The training configuration had eval_freq=0, so no offline evaluation videos, simulated rollouts, or real-robot success metrics were produced as part of the training job.

Recommended evaluation before deployment:

  • Run held-out demonstrations or manually selected validation episodes if available.
  • Run short supervised sanity checks to confirm camera mapping, state dimensions, and action unnormalization.
  • Start with low-speed, closely supervised real-robot rollouts.
  • Report success rate, number of trials, reset conditions, and failure modes separately from training loss.

Limitations and Safety

  • This policy is specialized to the recorded UR7e setup, camera placement, workspace geometry, block colors, and purple-line arrangement task.
  • Performance may degrade if camera extrinsics, lighting, object appearance, workspace layout, robot calibration, or control frequency differ from the training data.
  • The model card does not claim real-robot success rate. Validate the policy in the target environment before autonomous operation.
  • Use appropriate robot safety limits, emergency stop procedures, workspace supervision, and conservative speed/force settings during rollout.

Provenance

Training completed on 2026-05-10 at step 9203. The model weights were uploaded to this repository on 2026-05-11. The final checkpoint used for upload was:

lerobot/outputs/train/smolvla_ur7e_arrange_block_100epi_10fps_gbs256_ep50_20260510_112403/checkpoints/009203/pretrained_model

The first automatic push at the end of distributed training did not finish because the upload stalled while other ranks were waiting at a distributed barrier. The final repository upload was completed separately from the training process.

Citation

If you use this checkpoint, cite LeRobot and SmolVLA where appropriate:

@software{lerobot,
  title = {LeRobot: State-of-the-art Machine Learning for Real-World Robotics in Pytorch},
  author = {Hugging Face},
  url = {https://github.com/huggingface/lerobot},
  year = {2024}
}
@misc{smolvla,
  title = {SmolVLA: A compact vision-language-action model for robotics},
  url = {https://huggingface.co/papers/2506.01844}
}
Downloads last month
2
Safetensors
Model size
0.5B params
Tensor type
F32
·
BF16
·
Video Preview
loading

Model tree for Cache-SCA/smolVLA-UR7e-CaP_arrange_block_10fps

Finetuned
(6531)
this model

Dataset used to train Cache-SCA/smolVLA-UR7e-CaP_arrange_block_10fps

Paper for Cache-SCA/smolVLA-UR7e-CaP_arrange_block_10fps