Model Card for TBD-VLA

Project Webpage: https://tbd-vla.github.io/

TBD-VLA is a Vision-Language-Action policy based on Block Discrete Denoising Diffusion. It uses a Qwen3-VL vision-language backbone and predicts robot action chunks through temporal-level block diffusion.

This policy has been trained and pushed to the Hub using LeRobot fork. See the full documentation here.

How to Get Started with the Model

Installation

git clone https://github.com/TBD-VLA/lerobot.git
cd lerobot
uv python install 3.12
uv venv --python 3.12
source .venv/bin/activate
uv pip install -e ".[libero]"
uv pip install -U transformers
uv pip install -U accelerate

Train from scratch

python src/lerobot/scripts/lerobot_train.py \
  --policy.type=tbdvla \
  --output_dir=/$OUTPUT_DIR \
  --dataset.repo_id=sean1295/libero_all \
  --job_name=tbdvla_experiment \
  --steps=150000 \
  --batch_size=4 \
  --save_freq=20000 \
  --log_freq=1000 \
  --policy.device=cuda \
  --policy.n_bins=512 \
  --policy.block_temporal_size=4 \
  --policy.n_diffusion_steps=2 \
  --policy.gripper_dims=[-1] \
  --policy.chunk_size=16 \
  --policy.n_action_steps=16 \
  --policy.gradient_checkpointing=true \
  --policy.push_to_hub=false \
  --wandb.enable=false

Writes checkpoints to the configured output directory.

Multi-GPU training

accelerate launch --multi_gpu --num_processes=4 \
  src/lerobot/scripts/lerobot_train.py \
  --wandb.enable=false \
  --num_workers=4 \
  --policy.type=tbdvla \
  --policy.push_to_hub=false \
  --dataset.repo_id=sean1295/libero_all \
  --steps=150000 \
  --save_freq=20000 \
  --log_freq=1000 \
  --batch_size=16 \
  --policy.device=cuda \
  --policy.n_bins=512 \
  --policy.block_temporal_size=4 \
  --policy.n_diffusion_steps=2 \
  --policy.gripper_dims=[-1] \
  --policy.chunk_size=16 \
  --policy.n_action_steps=16

Evaluate the policy/run inference

uv run python src/lerobot/scripts/lerobot_eval.py \
  --policy.path=$CKPT_DIR \
  --env.type=libero \
  --env.task=libero_10 \
  --eval.n_episodes=50 \
  --eval.batch_size=1 \
  --eval.use_async_envs=false \
  --policy.device=cuda \
  --policy.n_action_steps=12 \
  --policy.n_diffusion_steps=2 \
  --policy.compile_model=true

Use --policy.path to point to a local or Hub checkpoint.

Model Details

Model type: Vision-Language-Action policy
Architecture: Block Diffusion VLA
VLM backbone: Qwen/Qwen3-VL-2B-Instruct
License: apache-2.0

Architecture

TBD-VLA contains the following main components:

Component	Description
VLM backbone	Qwen3-VL model used for vision-language conditioning
Action tokenizer	Discretizes continuous robot actions into token bins
Block denoising module	Performs block-temporal denoising over action chunks
Pre/post-processors	Handle normalization, device transfer, and action conversion

Files

File	Description
`configuration_tbdvla.py`	`TBDVLAConfig` dataclass with policy hyperparameters
`modeling_tbdvla.py`	`TBDVLAPolicy` implementation, including model, training loss, and inference
`processor_tbdvla.py`	Pre/post-processing pipelines for normalization and device transfer
`__init__.py`	Exports `TBDVLAConfig`, `TBDVLAPolicy`, and `make_tbdvla_pre_post_processors`

Key Parameters

TBD-VLA Parameters

Model Architecture

Parameter	Description	Default
`--policy.vlm_checkpoint`	Qwen3-VL model ID	`Qwen/Qwen3-VL-2B-Instruct`
`--policy.num_vlm_layers`	Number of VLM layers to use (-1 = all)	-1

Diffusion / Block Denoising

Parameter	Description	Default
`--policy.block_temporal_size`	Temporal steps per block	4
`--policy.n_diffusion_steps`	Number of denoising steps at inference	2
`--policy.chunk_size`	Action chunk length (multipliers of block_temporal_size)	16

Training Hyperparameters

Parameter	Description	Default
`--policy.n_bins`	Number of action discretization bins	512
`--policy.n_obs_steps`	Number of observation steps (only 1 supported)	1
`--policy.max_task_tokens`	Max task/language tokens fed to the VLM	64
`--policy.use_state`	Include proprioceptive state input	true
`--policy.state_dropout_p`	Dropout probability for state input	0.0
`--policy.image_resolution`	Resize images to this resolution before cropping (skipped if already that size)	256,256
`--policy.crop_shape`	Image crop dimensions (e.g., `224,224`)	None
`--policy.gradient_checkpointing`	Enable gradient checkpointing (saves VRAM)	false
`--policy.precision`	Training precision (`float16`, `bfloat16`, `float32`)	`bfloat16`
`--policy.attn_implementation`	Attention backend (`eager`, `sdpa`, `flex_attention`)	`sdpa`
`--policy.optimizer_lr`	AdamW learning rate (applied to all parameters)	1e-4
`--policy.optimizer_betas`	Adam betas	(0.95, 0.999)
`--policy.optimizer_weight_decay`	Weight decay	0.01
`--policy.scheduler_name`	LR scheduler type	`cosine`
`--policy.scheduler_warmup_steps`	Warmup steps	500
`--policy.grad_clip_norm`	Gradient clipping norm	1.0

Inference Hyperparameters

Parameter	Description	Default
`--policy.n_action_steps`	Steps executed per inference (must be <= chunk_size)	12
`--policy.gripper_dims`	Gripper dimension indices (for sticky (binary) grippers. Gripper values become either -1 or 1)	[-1]
`--policy.expectation_sample`	Use expectation-based sampling	true
`--policy.compile_model`	Wrap the VLM forward in `torch.compile` (faster inference, one-time compile cost)	false
`--policy.latency_timestep`	Compensation timestep using Real-Time Chunking	0

VLM Backbones

Set any Qwen3-VL checkpoint with:

--policy.vlm_checkpoint=<checkpoint_id>

The default checkpoint is:

Qwen/Qwen3-VL-2B-Instruct

Larger Qwen3-VL variants may increase model capacity but require more VRAM.

External Links

Project Webpage

TBD-VLA: Temporal Block Diffusion Vision Language Action Model

TBD-VLA LeRobot 🤗

LeRobot Fork

BibTex

@article{lee2026tbdvlatemporalblockdiffusion,
      title={TBD-VLA: Temporal Block Diffusion Vision Language Action Model},
      author={Lee, Sung-Wook and Kang, Xuhui and Kuo, Yen-Ling},
      journal={arXiv preprint},
      year={2026},
}