LIBERO-Spatial / README.md

VLA-Adapter

Update README.md

45caa9b verified 2 days ago

preview code

raw

history blame contribute delete

9.09 kB

metadata

license: mit
tags:
  - Vision-Language-Action
  - OpenHelix Team
base_model:
  - Qwen/Qwen2.5-0.5B
language:
  - en
pipeline_tag: robotics

Model Card for VLA-Adapter Libero-Spatial

VLA-Adapter: An Effective Paradigm for Tiny-Scale Vision-Language-Action Model trained on Libero-Spatial.

💬 Project page: https://vla-adapter.github.io/
🖥️ Dataset: https://huggingface.co/datasets/openvla/modified_libero_rlds/tree/main
🤗 HuggingFace: https://huggingface.co/VLA-Adapter
Github: https://github.com/OpenHelix-Team/VLA-Adapter

Model Details

We have developed and released the VLA-Adapter family of VLA models, a series of fine-tuned generative action models. The VLA-Adapter VLM follows the Prismatic-VLM architecture, using only a very small backbone (Qwen2.5-0.5B) for the LLM. On common robotics benchmarks, it surpasses open-source VLA models with 8.5B, 7B, 4B, 3B, and 2B backbones.

Input: Models input image and text.

Output: Models generate action only.

Model Architecture: The VLA-Adapter consists of a VLM for receiving and processing image and text information and a policy for generating actions. We systematically analyzed the benefits that the VLM provides to different types of policy conditions and determined a unified framework. We then utilized our designed Bridge Attention module to fuse the conditions generated by the VLM with the initial action information in the policy, bridging the gap between VL and A to the greatest extent possible. This resulted in a high-performance VLA model on a tiny-scale backbone.

Success Rate Comparison

LIBERO	Methods	Scale	Spatial	Object	Goal	Long	Avg.
Large-scale	FlowVLA (Zhong et al., 2025)	8.5B	93.2	95.0	91.6	72.6	88.1
	UnifiedVLA (Wang et al., 2025)	8.5B	95.4	98.8*	93.6	94.0	95.5
	OpenVLA (Kim et al., 2024)	7B	84.7	88.4	79.2	53.7	76.5
	OpenVLA-OFT (Kim et al., 2025)	7B	97.6*	98.4	97.9	94.5*	97.1*
	UniVLA (Bu et al., 2025)	7B	96.5	96.8	95.6	92.0	95.2
	CoT-VLA (Zhao et al., 2025)	7B	87.5	91.6	87.6	69.0	81.1
	WorldVLA (Cen et al., 2025)	7B	87.6	96.2	83.4	60.0	81.8
	TraceVLA (Zheng et al., 2025)	7B	84.6	85.2	75.1	54.1	74.8
	MolmoAct (Lee et al., 2025)	7B	87.0	95.4	87.6	77.2	86.6
	ThinkAct (Huang et al., 2025)	7B	88.3	91.4	87.1	70.9	84.4
Small-scale	4D-VLA (Zhang et al., 2025)	4B	88.9	95.2	90.9	79.1	88.6
	SpatialVLA (Qu et al., 2025)	4B	88.2	89.9	78.6	55.5	78.1
	π0 (Black et al., 2024)	3B	96.8	98.8*	95.8	85.2	94.2
	π0-FAST (Pertsch et al., 2025)	3B	96.4	96.8	88.6	60.2	85.5
	NORA (Hung et al., 2025)	3B	92.2	95.4	89.4	74.6	87.9
	SmolVLA (Shukor et al., 2025)	2.2B	93.0	94.0	91.0	77.0	88.8
	GR00T N1 (NVIDIA et al., 2025)	2B	94.4	97.6	93.0	90.6	93.9
Tiny-scale	Seer (Tian et al., 2025)	0.57B	-	-	-	78.7	78.7
	VLA-OS (Gao et al., 2025)	0.5B	87.0	96.5	92.7	66.0	85.6
	Diffusion Policy (Chi et al., 2023)	-	78.3	92.5	68.3	50.5	72.4
	VLA-Adapter (Ours)	0.5B	97.8	99.2	97.2*	95.0	97.3
	VLA-Adapter-Pro (Ours)	0.5B	*99.6*	*99.6*	*98.2*	*96.4*	*98.5*

CALVIN	Methods	Scale	1	2	3	4	5	Avg. len
Large-scale	UniVLA (Bu et al., 2025)	7B	95.5	85.8	75.4	66.9	56.5	3.80
	OpenVLA (Kim et al., 2024)	7B	91.3	77.8	62.0	52.1	43.5	3.27
	OpenVLA-OFT (Kim et al., 2025)	7B	96.3	89.1	82.4	75.8	66.5	4.10
	VLAS (Zhao et al., 2025b)	7B	87.2	64.2	40.9	28.1	19.6	2.40
	LCB (Shentu et al., 2024)	7B	73.6	50.2	28.5	16.0	9.9	1.78
	RoboDual (Bu et al., 2024a)	7B	94.4	82.7	72.1	62.4	54.4	3.66
	OpenHelix (Cui et al., 2025)	7B	97.1*	91.4	82.8	72.6	64.1	4.08
	ReconVLA (Song et al., 2025c)	7B	95.6	87.6	76.9	69.3	64.1	3.95
Small-scale	DeeR (Yue et al., 2024)	3B	86.2	70.1	51.8	41.5	30.4	2.82
	RoboFlamingo (Li et al., 2024b)	3B	82.4	61.9	46.6	33.1	23.5	2.48
	VPP (Hu et al., 2025)	1.5B	95.7	91.2	86.3*	81.0*	75.0*	4.33*
	SuSIE (Black et al., 2024)	1.3B	87.0	69.0	49.0	38.0	26.0	2.69
Tiny-scale	Seer-Large (Tian et al., 2025)	0.57B	96.3	91.6*	86.1	80.3	74.0	4.28
	MoDE (Reuss et al., 2025)	0.44B	96.2	88.9	81.1	71.8	63.5	4.01
	Seer (Tian et al., 2025)	0.32B	94.4	87.2	79.9	72.2	64.3	3.98
	VLA-Adapter (Ours)	0.5B	*99.1*	94.6	88.8	82.8	76.5	4.42
	VLA-Adapter-Pro (Ours)	0.5B	98.5	*95.0*	*90.5*	*85.3*	*80.0*	*4.50*

Citation instructions

@article{wang2025vlaadapter,
  author={Wang, Yihao and Ding, Pengxiang and Li, Lingxiao and Cui, Can and Ge, Zirui and Tong, Xinyang and Song, Wenxuan and Zhao, Han and Zhao, Wei and Hou, Pengxu and Huang, Siteng and Tang, Yifan and Wang, Wenhui and Zhang, Ru and Liu, Jianyi and Wang, Donglin},
  title={VLA-Adapter: An Effective Paradigm for Tiny-Scale Vision-Language-Action Model},
  journal={arXiv preprint arXiv:2509.09372},
  year={2025}
}