Papers
arxiv:2606.12366

APT: Action Expert Pretraining Improves Instruction Generalization of Vision-Language-Action Policies

Published on Jun 10
· Submitted by
Kechun Xu
on Jun 15
Authors:
,
,
,
,

Abstract

Researchers address poor generalization in Vision-Language-Action models by proposing APT, a two-stage training method that pretrains action experts using vision-action pairs before integrating language conditioning to improve out-of-distribution instruction performance.

Vision-Language-Action (VLA) models that couple pretrained Vision-Language Models (VLMs) with continuous action experts have achieved strong manipulation performance, yet generalization to out-of-distribution (OOD) language instructions remains poor. A known challenge is the structural imbalance in VLA data, where language is far less diverse than visual and action content, making policies prone to visual shortcuts. While discrete-action methods mitigate this through vision-language co-training, continuous action experts lack such protection: they start from random initialization and learn entirely from imbalanced data, producing noisy gradients that corrupt the VLM and fail to exploit its language capability. We address this from a Bayesian perspective, factorizing the policy into a language-agnostic Vision-Action (VA) prior and a language-conditioned VLA likelihood, and propose APT, a two-stage training method emphasizing Action expert PreTraining. In Stage 1, the action expert is pretrained as a VA prior on vision-action pairs from a frozen VLM, bypassing the language imbalance. In Stage 2, language tokens are injected through a gated fusion mechanism that integrates VLM features while preserving the learned visuomotor prior. APT applies to mainstream VLA architectures, including the π and GR00T-style architectures. Comprehensive experiments validate that APT achieves consistent gains on unseen instructions and compositional tasks. Project Page: https://xukechun.github.io/papers/APT/

Community

Paper submitter

We improve out-of-distribution language generalization of continuous-action VLA policies through action expert pretraining. Guided by a Bayesian factorization, we first pretrain the action expert as a language-agnostic Vision-Action (VA) prior, then inject language to form the VLA likelihood.

Sign up or log in to comment

Get this paper in your agent:

hf papers read 2606.12366
Don't have the latest CLI?
curl -LsSf https://hf.co/cli/install.sh | bash

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2606.12366 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2606.12366 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2606.12366 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.