CARP: Visuomotor Policy Learning
via Coarse-to-Fine Autoregressive Prediction

Zhefei Gong¹, Pengxiang Ding¹², Shangke Lyu¹, Siteng Huang¹², Mingyang Sun¹², Wei Zhao¹,
Zhaoxin Fan³, Donglin Wang^1✉
¹Westlake University, ²Zhejiang University,
³Beijing Advanced Innovation Center for Future Blockchain and Privacy Computing

👀 Overview

TL;DR: introduce Coarse-to-Fine AutoRegressive Policy (CARP), a novel paradigm for visuomotor policy learning that redefines the autoregressive action generation process as a coarse-to-fine, next-scale approach.

The left panel shows the final predicted trajectories for each task, with CARP producing smoother and more consistent paths than Diffusion Policy (DP). The right panel visualizes intermediate trajectories during the refinement process for CARP (top-right) and DP (bottom-right). DP displays considerable redundancy, resulting in slower processing and unstable training, as illustrated by 6 selected steps among 100 denoising steps. In contrast, CARP achieves efficient trajectory refinement across all 4 scales, with each step contributing meaningful updates.

🙏 Acknowledgment

We sincerely thank the creators of the excellent repositories, including Visual Autoregressive Model, Diffusion Policy, and Sparse Diffusion Policy, which have provided invaluable inspiration.

🏷️ License

This repository is released under the MIT license. See LICENSE MIT for additional details.

📌 Citation

If our findings contribute to your research, we would appreciate it if you could consider citing our paper in your publications.

@misc{gong2024carpvisuomotorpolicylearning,
      title={CARP: Visuomotor Policy Learning via Coarse-to-Fine Autoregressive Prediction}, 
      author={Zhefei Gong and Pengxiang Ding and Shangke Lyu and Siteng Huang and Mingyang Sun and Wei Zhao and Zhaoxin Fan and Donglin Wang},
      year={2024},
      eprint={2412.06782},
      archivePrefix={arXiv},
      primaryClass={cs.RO},
      url={https://arxiv.org/abs/2412.06782}, 
}