clap-qwen3vl4b
Stage 3 (CLAP-NTP) checkpoint of CLAP: Contrastive Latent Action Pretraining for Learning Vision-Language-Action Models from Human Videos.
A Qwen3-VL-4B autoregressive VLA policy pretrained on a mixture of robot teleoperation data (AgiBot, Astribot S1, DROID) and unlabeled human egocentric videos (Ego4D), using a frozen CLAP latent action tokenizer trained in Stages 1–2 (Act-VAE + VD-VAE). Outputs subtask text and discrete CLAP action tokens. The folder also bundles clap.ckpt, the frozen Stage-2 tokenizer used by downstream post-training (CLAP-RF + Knowledge Matching).
This is the pretrained generalist checkpoint — typically used as the initialization for target-domain post-training (e.g. Astribot S1 sync deployment, LIBERO finetuning). The model itself also has strong ability on pick-and-place tasks on Astribot S1 robots.
Links
- 💻 Code: https://github.com/LinShan-Bin/OpenCLAP
- 📄 Paper: https://arxiv.org/abs/2601.04061
- 🌐 Project page: https://lin-shan.com/CLAP/
Usage
hf download LinShan/clap-qwen3vl4b --local-dir ./pretrained/clap-s3-l32
See the README in the GitHub repo for training and post-training recipes (QwenAR for NTP, QwenPIKM for the rectified-flow head with Knowledge Matching).
Citation
@article{zhang2026clap,
title={CLAP: Contrastive Latent Action Pretraining for Learning Vision-Language-Action Models from Human Videos},
author={Zhang, Chubin and Wang, Jianan and Gao, Zifeng and Su, Yue and Dai, Tianru and Zhou, Cai and Lu, Jiwen and Tang, Yansong},
journal={arXiv preprint arXiv:2601.04061},
year={2026}
}
- Downloads last month
- 34