Model Card for vla_jepa

VLA-JEPA is a Vision-Language-Action model that combines a Qwen3-VL language backbone with a self-supervised video world model (V-JEPA2) and a flow-matching DiT action head.

This policy has been trained and pushed to the Hub using LeRobot.

Learn how to train and run it in the LeRobot vla_jepa guide, or browse the full documentation.

Model Details

License: apache-2.0
Robot type: so101_follower
Cameras: arm_camera, overhead_camera

Inputs & Outputs

The policy consumes these observation features and produces these action features.

Inputs

Feature	Type	Shape
`observation.state`	STATE	`(6,)`
`observation.images.arm_camera`	VISUAL	`(3, 240, 320)`
`observation.images.overhead_camera`	VISUAL	`(3, 240, 320)`

Outputs

Feature	Type	Shape
`action`	ACTION	`(6,)`

Training Dataset

Repository: binhpham/naive-bench
Episodes: 312
Frames: 115378
Frame rate: 30 FPS
Task(s): "put the blue bar into the white bin", "put the blue bar into the yellow bin", "put the blue bar into the blue bin", "put the red bar into the yellow bin", "put the red bar into the orange bin", "put the red bar into the blue bin", "put the yellow bar into the white bin", "put the yellow bar into the yellow bin", "put the yellow bar into the orange bin", "put the green bar into the white bin", "put the green bar into the yellow bin", "put the green bar into the blue bin", "put the purple bar into the white bin", "put the purple bar into the yellow bin", "put the purple bar into the orange bin", "put the purple bar into the blue bin", "put the orange bar into the white bin", "put the orange bar into the yellow bin", "put the orange bar into the orange bin", "put the orange bar into the blue bin"

Training Configuration

Setting	Value
Training steps	30000
Batch size	16
Optimizer	adamw
Learning rate	0.0001
Seed	1000
LeRobot version	0.5.2

How to Get Started with the Model

New to LeRobot? These guides cover the full workflow:

Install LeRobot — set up the lerobot package.
Hardware setup — assemble, wire, and calibrate your robot and cameras.
Record data & train a policy — the end-to-end imitation-learning walkthrough.
CLI cheat-sheet — quick reference for the lerobot-* commands.

The short version to run and train this policy:

Run the policy on your robot

lerobot-rollout \
  --strategy.type=base \
  --robot.type=so101_follower \
  --robot.port=<your_robot_port> \
  --robot.cameras="{ <camera_1>: {type: opencv, index_or_path: <index_or_path>, width: 640, height: 480, fps: 30}, <camera_2>: {type: opencv, index_or_path: <index_or_path>, width: 640, height: 480, fps: 30}}" \
  --policy.path=binhpham/naive-bench-vla-jepa \
  --task="put the blue bar into the white bin" \
  --duration=60

Replace the remaining <...> placeholders with your own values: --robot.port and the camera names/indices are specific to your machine, and the camera names must match the observation keys this policy was trained on.

When --strategy.type=base is used the script doesn't record the episodes. Skipping duration will make the policy run indefinitely. For more information look at rollout documentation.

Train your own policy

lerobot-train \
  --dataset.repo_id=${HF_USER}/<dataset> \
  --policy.type=vla_jepa \
  --output_dir=outputs/train/<policy_repo_id> \
  --job_name=lerobot_training \
  --policy.device=cuda \
  --policy.repo_id=${HF_USER}/<policy_repo_id> \
  --wandb.enable=true

Writes checkpoints to outputs/train/<policy_repo_id>/checkpoints/.

Evaluation

No evaluation results have been provided for this policy yet.

Citation

If you use this policy, please cite the method linked in the description above, along with LeRobot:

@misc{cadene2024lerobot,
    author = {Cadene, Remi and Alibert, Simon and Soare, Alexander and Gallouedec, Quentin and Zouitine, Adil and Palma, Steven and Kooijmans, Pepijn and Aractingi, Michel and Shukor, Mustafa and Aubakirova, Dana and Russi, Martino and Capuano, Francesco and Pascal, Caroline and Choghari, Jade and Moss, Jess and Wolf, Thomas},
    title = {LeRobot: State-of-the-art Machine Learning for Real-World Robotics in Pytorch},
    howpublished = "\url{https://github.com/huggingface/lerobot}",
    year = {2024}
}

Downloads last month: 27

Safetensors

Model size

3B params

Tensor type

F32

BF16

Video Preview

Robotics

Dataset used to train binhpham/naive-bench-vla-jepa

Paper for binhpham/naive-bench-vla-jepa

VLA-JEPA: Enhancing Vision-Language-Action Model with Latent World Model

Paper • 2602.10098 • Published Feb 10 • 22