Model Card for xvla

X-VLA is a soft-prompted, flow-matching Vision-Language-Action framework that treats each robot or hardware setup as a "task" encoded with a small set of learnable Soft Prompt embeddings, letting a single model reconcile diverse robot morphologies, sensors, and action spaces.

xvla architecture

This policy has been trained and pushed to the Hub using LeRobot.

Learn how to train and run it in the LeRobot xvla guide, or browse the full documentation.

Model Details

License: apache-2.0
Fine-tuned from: lerobot/xvla-base
Robot type: panda
Cameras: image, image2

Inputs & Outputs

The policy consumes these observation features and produces these action features.

Inputs

Feature	Type	Shape
`observation.images.image`	VISUAL	`(3, 256, 256)`
`observation.images.image2`	VISUAL	`(3, 256, 256)`
`observation.state`	STATE	`(8,)`
`observation.images.image3`	VISUAL	`(3, 224, 224)`

Outputs

Feature	Type	Shape
`action`	ACTION	`(7,)`

Training Dataset

Repository: HuggingFaceVLA/libero
Episodes: 1693
Frames: 273465
Frame rate: 10.0 FPS
Task(s): "put the white mug on the left plate and put the yellow and white mug on the right plate", "put the white mug on the plate and put the chocolate pudding to the right of the plate", "put the yellow and white mug in the microwave and close it", "turn on the stove and put the moka pot on it", "put both the alphabet soup and the cream cheese box in the basket", "put both the alphabet soup and the tomato sauce in the basket", "put both moka pots on the stove", "put both the cream cheese box and the butter in the basket", "put the black bowl in the bottom drawer of the cabinet and close it", "pick up the book and place it in the back compartment of the caddy", "put the bowl on the plate", "put the wine bottle on the rack", "open the top drawer and put the bowl inside", "put the cream cheese in the bowl", "put the wine bottle on top of the cabinet", "push the plate to the front of the stove", "turn on the stove", "put the bowl on the stove", "put the bowl on top of the cabinet", "open the middle drawer of the cabinet", "pick up the orange juice and place it in the basket", "pick up the ketchup and place it in the basket", "pick up the cream cheese and place it in the basket", "pick up the bbq sauce and place it in the basket", "pick up the alphabet soup and place it in the basket", "pick up the milk and place it in the basket", "pick up the salad dressing and place it in the basket", "pick up the butter and place it in the basket", "pick up the tomato sauce and place it in the basket", "pick up the chocolate pudding and place it in the basket", "pick up the black bowl next to the cookie box and place it on the plate", "pick up the black bowl in the top drawer of the wooden cabinet and place it on the plate", "pick up the black bowl on the ramekin and place it on the plate", "pick up the black bowl on the stove and place it on the plate", "pick up the black bowl between the plate and the ramekin and place it on the plate", "pick up the black bowl on the cookie box and place it on the plate", "pick up the black bowl next to the plate and place it on the plate", "pick up the black bowl next to the ramekin and place it on the plate", "pick up the black bowl from table center and place it on the plate", "pick up the black bowl on the wooden cabinet and place it on the plate"

Training Configuration

Setting	Value
Training steps	1
Batch size	1
Optimizer	xvla-adamw
Learning rate	0.0001
Seed	1000
LeRobot version	0.5.2

How to Get Started with the Model

New to LeRobot? These guides cover the full workflow:

Install LeRobot — set up the lerobot package.
Hardware setup — assemble, wire, and calibrate your robot and cameras.
Record data & train a policy — the end-to-end imitation-learning walkthrough.
CLI cheat-sheet — quick reference for the lerobot-* commands.

The short version to run and train this policy:

Run the policy on your robot

lerobot-rollout \
  --strategy.type=base \
  --robot.type=panda \
  --robot.port=<your_robot_port> \
  --robot.cameras="{ <camera_1>: {type: opencv, index_or_path: <index_or_path>, width: 640, height: 480, fps: 30}, <camera_2>: {type: opencv, index_or_path: <index_or_path>, width: 640, height: 480, fps: 30}}" \
  --policy.path=xvla-cpu-smoke \
  --task="put the white mug on the left plate and put the yellow and white mug on the right plate" \
  --duration=60

Replace the remaining <...> placeholders with your own values: --robot.port and the camera names/indices are specific to your machine, and the camera names must match the observation keys this policy was trained on.

When --strategy.type=base is used the script doesn't record the episodes. Skipping duration will make the policy run indefinitely. For more information look at rollout documentation.

Train your own policy

This policy type is usually fine-tuned from the pretrained base model lerobot/xvla-base:

lerobot-train \
  --dataset.repo_id=${HF_USER}/<dataset> \
  --policy.path=lerobot/xvla-base \
  --output_dir=outputs/train/<policy_repo_id> \
  --job_name=lerobot_training \
  --policy.device=cuda \
  --policy.repo_id=${HF_USER}/<policy_repo_id> \
  --wandb.enable=true

Writes checkpoints to outputs/train/<policy_repo_id>/checkpoints/.

Evaluation

No evaluation results have been provided for this policy yet.

Citation

If you use this policy, please cite the method linked in the description above, along with LeRobot:

@misc{cadene2024lerobot,
    author = {Cadene, Remi and Alibert, Simon and Soare, Alexander and Gallouedec, Quentin and Zouitine, Adil and Palma, Steven and Kooijmans, Pepijn and Aractingi, Michel and Shukor, Mustafa and Aubakirova, Dana and Russi, Martino and Capuano, Francesco and Pascal, Caroline and Choghari, Jade and Moss, Jess and Wolf, Thomas},
    title = {LeRobot: State-of-the-art Machine Learning for Real-World Robotics in Pytorch},
    howpublished = "\url{https://github.com/huggingface/lerobot}",
    year = {2024}
}

Downloads last month: -

Safetensors

Model size

0.9B params

Tensor type

F32

Video Preview

Robotics

Model tree for danielyeh2026/xvla-cpu-smoke

Base model

lerobot/xvla-base

Finetuned

(18)

this model

Dataset used to train danielyeh2026/xvla-cpu-smoke

Paper for danielyeh2026/xvla-cpu-smoke

X-VLA: Soft-Prompted Transformer as Scalable Cross-Embodiment Vision-Language-Action Model

Paper • 2510.10274 • Published Oct 11, 2025 • 16