smolVLA-UR7e-CaP_arrange_block_10fps

This repository contains a LeRobot SmolVLA policy fine-tuned for a UR7e block-arrangement task. The policy was trained on demonstrations from CoRL2026-CSI/UR7e-CaP_arrange_block_100epi_10fps, where the robot arranges red, green, and blue blocks along a purple line from left to right.

The checkpoint is intended for research use with LeRobot-compatible inference pipelines. No real-robot or offline success-rate evaluation is included in this model card; the reported metrics are training logs only.

Model Details

Model type: SmolVLA vision-language-action policy
Base policy: lerobot/smolvla_base
VLM backbone: HuggingFaceTB/SmolVLM2-500M-Video-Instruct
Robot: UR7e
Task: Arrange red, green, blue blocks along a purple line from left to right
Training framework: LeRobot
Checkpoint format: safetensors
License: Apache 2.0

Dataset

The policy was trained on CoRL2026-CSI/UR7e-CaP_arrange_block_100epi_10fps, a LeRobot dataset collected for the UR7e block-arrangement task.

Dataset summary:

Field	Value
Robot type	`ur7e`
Episodes	100
Frames	47,116
Dataset FPS	10
Tasks	1
Split	`train: 0:100`
Cameras	RealSense wrist and top-view RGB video
Camera resolution	480 x 640 RGB video
Dataset state/action vectors	7D joint/gripper vector

The dataset includes additional skill annotations such as skill.type, skill.progress, target joint positions, target Cartesian poses, and natural-language skill text. The policy checkpoint uses the LeRobot preprocessing pipeline saved in this repository.

Policy Inputs and Outputs

The saved policy configuration expects the following model features after preprocessing:

Inputs, according to the saved policy config:

observation.state: 6D state feature
observation.images.camera1: wrist camera, resized/padded for SmolVLA
observation.images.camera2: top-view camera, resized/padded for SmolVLA
observation.images.camera3: visual input slot
observation.images.empty_camera_0: empty camera placeholder

Output, according to the saved policy config:

action: 7D joint/gripper action vector

The included policy_preprocessor.json maps dataset camera names to model camera names:

observation.images.realsense_wrist -> observation.images.camera1
observation.images.realsense_topview -> observation.images.camera2

State and action features use mean/std normalization. Visual features use identity normalization. The postprocessor unnormalizes the action output and moves it back to CPU.

Training Details

The final uploaded checkpoint is from step 9203.

Setting	Value
Training steps	9,203
Approx. epochs	50
Batch size	128
Gradient accumulation	1
Seed	1000
Optimizer	AdamW
Peak learning rate	`1e-4`
Weight decay	`1e-10`
Gradient clipping	`10.0`
Scheduler	Cosine decay with warmup
Warmup steps	1,000
Decay steps	30,000
Final decay LR	`2.5e-6`
AMP	Disabled
PEFT	Disabled
Vision encoder	Frozen
Expert-only training	Enabled
State projection training	Enabled
Action chunk size	50
Observation steps	1
Action steps	50

Image augmentation was enabled during training with up to two randomly ordered transforms per sample:

brightness jitter: [0.8, 1.2]
contrast jitter: [0.8, 1.2]
saturation jitter: [0.5, 1.5]
hue jitter: [-0.05, 0.05]
sharpness jitter: [0.5, 1.5]
random affine rotation: [-5, 5] degrees
random affine translation: 0.05

Training logs:

Metric	Value
Final logged training loss	`0.010`
Mean training loss over last 20 logged points	`0.01045`
Final logged gradient norm	`0.101`
Final logged learning rate	`2.5e-6`

These values are training-loop logs only and should not be interpreted as task success rates.

How to Use

Install LeRobot and load the policy from the Hub:

from lerobot.policies.smolvla.modeling_smolvla import SmolVLAPolicy

policy = SmolVLAPolicy.from_pretrained(
    "CoRL2026-CSI/smolVLA-UR7e-CaP_arrange_block_10fps"
)
policy.to("cuda")
policy.eval()

For robot rollout or evaluation, use the LeRobot CLI or your existing UR7e control stack with --policy.path pointing to this repository:

lerobot-record \
  --policy.path=CoRL2026-CSI/smolVLA-UR7e-CaP_arrange_block_10fps \
  --dataset.repo_id=CoRL2026-CSI/eval_smolVLA-UR7e-CaP_arrange_block_10fps

Adjust the robot, camera, and dataset arguments to match the local UR7e deployment setup.

Files

This repository contains:

model.safetensors: policy weights
config.json: policy configuration
train_config.json: LeRobot training configuration
policy_preprocessor.json: saved inference preprocessing pipeline
policy_preprocessor_step_5_normalizer_processor.safetensors: normalization state
policy_postprocessor.json: saved inference postprocessing pipeline
policy_postprocessor_step_0_unnormalizer_processor.safetensors: action unnormalization state

Evaluation

No evaluation run is reported for this checkpoint. The training configuration had eval_freq=0, so no offline evaluation videos, simulated rollouts, or real-robot success metrics were produced as part of the training job.

Recommended evaluation before deployment:

Run held-out demonstrations or manually selected validation episodes if available.
Run short supervised sanity checks to confirm camera mapping, state dimensions, and action unnormalization.
Start with low-speed, closely supervised real-robot rollouts.
Report success rate, number of trials, reset conditions, and failure modes separately from training loss.

Limitations and Safety

This policy is specialized to the recorded UR7e setup, camera placement, workspace geometry, block colors, and purple-line arrangement task.
Performance may degrade if camera extrinsics, lighting, object appearance, workspace layout, robot calibration, or control frequency differ from the training data.
The model card does not claim real-robot success rate. Validate the policy in the target environment before autonomous operation.
Use appropriate robot safety limits, emergency stop procedures, workspace supervision, and conservative speed/force settings during rollout.

Provenance

Training completed on 2026-05-10 at step 9203. The model weights were uploaded to this repository on 2026-05-11. The final checkpoint used for upload was:

lerobot/outputs/train/smolvla_ur7e_arrange_block_100epi_10fps_gbs256_ep50_20260510_112403/checkpoints/009203/pretrained_model

The first automatic push at the end of distributed training did not finish because the upload stalled while other ranks were waiting at a distributed barrier. The final repository upload was completed separately from the training process.

Citation

If you use this checkpoint, cite LeRobot and SmolVLA where appropriate:

@software{lerobot,
  title = {LeRobot: State-of-the-art Machine Learning for Real-World Robotics in Pytorch},
  author = {Hugging Face},
  url = {https://github.com/huggingface/lerobot},
  year = {2024}
}

@misc{smolvla,
  title = {SmolVLA: A compact vision-language-action model for robotics},
  url = {https://huggingface.co/papers/2506.01844}
}

Downloads last month: 2

Safetensors

Model size

0.5B params

Tensor type

F32

BF16

Video Preview

Robotics

Model tree for Cache-SCA/smolVLA-UR7e-CaP_arrange_block_10fps

Base model

lerobot/smolvla_base

Finetuned

(6531)

this model

Dataset used to train Cache-SCA/smolVLA-UR7e-CaP_arrange_block_10fps

Paper for Cache-SCA/smolVLA-UR7e-CaP_arrange_block_10fps

SmolVLA: A Vision-Language-Action Model for Affordable and Efficient Robotics

Paper • 2506.01844 • Published Jun 2, 2025 • 161