CLAP
Collection
Pretrained models for "CLAP: Contrastive Latent Action Pretraining for Learning Vision-Language-Action Models from Human Videos". • 2 items • Updated • 1
LIBERO post-trained checkpoint of CLAP: Contrastive Latent Action Pretraining for Learning Vision-Language-Action Models from Human Videos.
A single generalist policy across all four LIBERO suites (Spatial / Object / Goal / Long), obtained by post-training LinShan/clap-qwen3vl4b with a rectified-flow continuous action head (CLAP-RF) attending to the frozen NTP backbone's KV cache, regularized by reverse-KL Knowledge Matching (KM) toward the frozen NTP reference. Trained for 30k steps at batch size 128 on the union of the four LIBERO suites.
| Spatial | Object | Goal | Long | Average |
|---|---|---|---|---|
| 98.6 | 99.2 | 98.0 | 93.0 | 97.2 |
hf download LinShan/clap-qwen3vl4b-libero \
--local-dir ./ckpts/Checkpoints/libero_clap_s3_l32_qwen3vl4b_km_l16
Evaluation uses a client–server split (policy server in the openclap env, simulator client in a separate libero env):
# Terminal 1 (openclap env)
bash examples/LIBERO/eval_files/run_policy_server.sh
# Terminal 2 (libero env)
bash examples/LIBERO/eval_files/eval_libero.sh
@article{zhang2026clap,
title={CLAP: Contrastive Latent Action Pretraining for Learning Vision-Language-Action Models from Human Videos},
author={Zhang, Chubin and Wang, Jianan and Gao, Zifeng and Su, Yue and Dai, Tianru and Zhou, Cai and Lu, Jiwen and Tang, Yansong},
journal={arXiv preprint arXiv:2601.04061},
year={2026}
}