clap-qwen3vl4b-libero

LIBERO post-trained checkpoint of CLAP: Contrastive Latent Action Pretraining for Learning Vision-Language-Action Models from Human Videos.

A single generalist policy across all four LIBERO suites (Spatial / Object / Goal / Long), obtained by post-training LinShan/clap-qwen3vl4b with a rectified-flow continuous action head (CLAP-RF) attending to the frozen NTP backbone's KV cache, regularized by reverse-KL Knowledge Matching (KM) toward the frozen NTP reference. Trained for 30k steps at batch size 128 on the union of the four LIBERO suites.

LIBERO performance

Spatial	Object	Goal	Long	Average
98.6	99.2	98.0	93.0	97.2

Usage

hf download LinShan/clap-qwen3vl4b-libero \
  --local-dir ./ckpts/Checkpoints/libero_clap_s3_l32_qwen3vl4b_km_l16

Evaluation uses a client–server split (policy server in the openclap env, simulator client in a separate libero env):

# Terminal 1 (openclap env)
bash examples/LIBERO/eval_files/run_policy_server.sh

# Terminal 2 (libero env)
bash examples/LIBERO/eval_files/eval_libero.sh

Citation

@article{zhang2026clap,
  title={CLAP: Contrastive Latent Action Pretraining for Learning Vision-Language-Action Models from Human Videos},
  author={Zhang, Chubin and Wang, Jianan and Gao, Zifeng and Su, Yue and Dai, Tianru and Zhou, Cai and Lu, Jiwen and Tang, Yansong},
  journal={arXiv preprint arXiv:2601.04061},
  year={2026}
}