clap-qwen3vl4b

Stage 3 (CLAP-NTP) checkpoint of CLAP: Contrastive Latent Action Pretraining for Learning Vision-Language-Action Models from Human Videos.

A Qwen3-VL-4B autoregressive VLA policy pretrained on a mixture of robot teleoperation data (AgiBot, Astribot S1, DROID) and unlabeled human egocentric videos (Ego4D), using a frozen CLAP latent action tokenizer trained in Stages 1–2 (Act-VAE + VD-VAE). Outputs subtask text and discrete CLAP action tokens. The folder also bundles clap.ckpt, the frozen Stage-2 tokenizer used by downstream post-training (CLAP-RF + Knowledge Matching).

This is the pretrained generalist checkpoint — typically used as the initialization for target-domain post-training (e.g. Astribot S1 sync deployment, LIBERO finetuning). The model itself also has strong ability on pick-and-place tasks on Astribot S1 robots.

Usage

hf download LinShan/clap-qwen3vl4b --local-dir ./pretrained/clap-s3-l32

See the README in the GitHub repo for training and post-training recipes (QwenAR for NTP, QwenPIKM for the rectified-flow head with Knowledge Matching).

Citation

@article{zhang2026clap,
  title={CLAP: Contrastive Latent Action Pretraining for Learning Vision-Language-Action Models from Human Videos},
  author={Zhang, Chubin and Wang, Jianan and Gao, Zifeng and Su, Yue and Dai, Tianru and Zhou, Cai and Lu, Jiwen and Tang, Yansong},
  journal={arXiv preprint arXiv:2601.04061},
  year={2026}
}