Instructions to use sean1295/tbdvla_libero with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- LeRobot
How to use sean1295/tbdvla_libero with LeRobot:
- Notebooks
- Google Colab
- Kaggle
Model Card for TBD-VLA
Project Webpage: https://tbd-vla.github.io/
TBD-VLA is a Vision-Language-Action policy based on Block Discrete Denoising Diffusion. It uses a Qwen3-VL vision-language backbone and predicts robot action chunks through temporal-level block diffusion.
This policy has been trained and pushed to the Hub using LeRobot fork. See the full documentation here.
How to Get Started with the Model
Installation
git clone https://github.com/TBD-VLA/lerobot.git
cd lerobot
uv python install 3.12
uv venv --python 3.12
source .venv/bin/activate
uv pip install -e ".[libero]"
uv pip install -U transformers
uv pip install -U accelerate
Train from scratch
python src/lerobot/scripts/lerobot_train.py \
--policy.type=tbdvla \
--output_dir=/$OUTPUT_DIR \
--dataset.repo_id=sean1295/libero_all \
--job_name=tbdvla_experiment \
--steps=150000 \
--batch_size=4 \
--save_freq=20000 \
--log_freq=1000 \
--policy.device=cuda \
--policy.n_bins=512 \
--policy.block_temporal_size=4 \
--policy.n_diffusion_steps=2 \
--policy.gripper_dims=[-1] \
--policy.chunk_size=16 \
--policy.n_action_steps=16 \
--policy.gradient_checkpointing=true \
--policy.push_to_hub=false \
--wandb.enable=false
Writes checkpoints to the configured output directory.
Multi-GPU training
accelerate launch --multi_gpu --num_processes=4 \
src/lerobot/scripts/lerobot_train.py \
--wandb.enable=false \
--num_workers=4 \
--policy.type=tbdvla \
--policy.push_to_hub=false \
--dataset.repo_id=sean1295/libero_all \
--steps=150000 \
--save_freq=20000 \
--log_freq=1000 \
--batch_size=16 \
--policy.device=cuda \
--policy.n_bins=512 \
--policy.block_temporal_size=4 \
--policy.n_diffusion_steps=2 \
--policy.gripper_dims=[-1] \
--policy.chunk_size=16 \
--policy.n_action_steps=16
Evaluate the policy/run inference
uv run python src/lerobot/scripts/lerobot_eval.py \
--policy.path=$CKPT_DIR \
--env.type=libero \
--env.task=libero_10 \
--eval.n_episodes=50 \
--eval.batch_size=1 \
--eval.use_async_envs=false \
--policy.device=cuda \
--policy.n_action_steps=12 \
--policy.n_diffusion_steps=2 \
--policy.compile_model=true
Use --policy.path to point to a local or Hub checkpoint.
Model Details
- Model type: Vision-Language-Action policy
- Architecture: Block Diffusion VLA
- VLM backbone:
Qwen/Qwen3-VL-2B-Instruct - License: apache-2.0
Architecture
TBD-VLA contains the following main components:
| Component | Description |
|---|---|
| VLM backbone | Qwen3-VL model used for vision-language conditioning |
| Action tokenizer | Discretizes continuous robot actions into token bins |
| Block denoising module | Performs block-temporal denoising over action chunks |
| Pre/post-processors | Handle normalization, device transfer, and action conversion |
Files
| File | Description |
|---|---|
configuration_tbdvla.py |
TBDVLAConfig dataclass with policy hyperparameters |
modeling_tbdvla.py |
TBDVLAPolicy implementation, including model, training loss, and inference |
processor_tbdvla.py |
Pre/post-processing pipelines for normalization and device transfer |
__init__.py |
Exports TBDVLAConfig, TBDVLAPolicy, and make_tbdvla_pre_post_processors |
Key Parameters
TBD-VLA Parameters
Model Architecture
| Parameter | Description | Default |
|---|---|---|
--policy.vlm_checkpoint |
Qwen3-VL model ID | Qwen/Qwen3-VL-2B-Instruct |
--policy.num_vlm_layers |
Number of VLM layers to use (-1 = all) | -1 |
Diffusion / Block Denoising
| Parameter | Description | Default |
|---|---|---|
--policy.block_temporal_size |
Temporal steps per block | 4 |
--policy.n_diffusion_steps |
Number of denoising steps at inference | 2 |
--policy.chunk_size |
Action chunk length (multipliers of block_temporal_size) | 16 |
Training Hyperparameters
| Parameter | Description | Default |
|---|---|---|
--policy.n_bins |
Number of action discretization bins | 512 |
--policy.n_obs_steps |
Number of observation steps (only 1 supported) | 1 |
--policy.max_task_tokens |
Max task/language tokens fed to the VLM | 64 |
--policy.use_state |
Include proprioceptive state input | true |
--policy.state_dropout_p |
Dropout probability for state input | 0.0 |
--policy.image_resolution |
Resize images to this resolution before cropping (skipped if already that size) | 256,256 |
--policy.crop_shape |
Image crop dimensions (e.g., 224,224) |
None |
--policy.gradient_checkpointing |
Enable gradient checkpointing (saves VRAM) | false |
--policy.precision |
Training precision (float16, bfloat16, float32) |
bfloat16 |
--policy.attn_implementation |
Attention backend (eager, sdpa, flex_attention) |
sdpa |
--policy.optimizer_lr |
AdamW learning rate (applied to all parameters) | 1e-4 |
--policy.optimizer_betas |
Adam betas | (0.95, 0.999) |
--policy.optimizer_weight_decay |
Weight decay | 0.01 |
--policy.scheduler_name |
LR scheduler type | cosine |
--policy.scheduler_warmup_steps |
Warmup steps | 500 |
--policy.grad_clip_norm |
Gradient clipping norm | 1.0 |
Inference Hyperparameters
| Parameter | Description | Default |
|---|---|---|
--policy.n_action_steps |
Steps executed per inference (must be <= chunk_size) | 12 |
--policy.gripper_dims |
Gripper dimension indices (for sticky (binary) grippers. Gripper values become either -1 or 1) | [-1] |
--policy.expectation_sample |
Use expectation-based sampling | true |
--policy.compile_model |
Wrap the VLM forward in torch.compile (faster inference, one-time compile cost) |
false |
--policy.latency_timestep |
Compensation timestep using Real-Time Chunking | 0 |
VLM Backbones
Set any Qwen3-VL checkpoint with:
--policy.vlm_checkpoint=<checkpoint_id>
The default checkpoint is:
Qwen/Qwen3-VL-2B-Instruct
Larger Qwen3-VL variants may increase model capacity but require more VRAM.
External Links
Project Webpage
TBD-VLA: Temporal Block Diffusion Vision Language Action Model
TBD-VLA LeRobot 🤗
BibTex
@article{lee2026tbdvlatemporalblockdiffusion,
title={TBD-VLA: Temporal Block Diffusion Vision Language Action Model},
author={Lee, Sung-Wook and Kang, Xuhui and Kuo, Yen-Ling},
journal={arXiv preprint},
year={2026},
}
- Downloads last month
- 156
Model tree for sean1295/tbdvla_libero
Base model
Qwen/Qwen3-VL-2B-Instruct