arxiv:2407.20179

Theia: Distilling Diverse Vision Foundation Models for Robot Learning

Published on Jul 29

· Submitted by

Jinghuan on Jul 30

Upvote

Authors:

Jinghuan Shang ,

Karl Schmeckpeper ,

Brandon B. May ,

Maria Vittoria Minniti ,

Tarik Kelestemur ,

David Watkins ,

Laura Herlant

Abstract

Vision-based robot policy learning, which maps visual inputs to actions, necessitates a holistic understanding of diverse visual tasks beyond single-task needs like classification or segmentation. Inspired by this, we introduce Theia, a vision foundation model for robot learning that distills multiple off-the-shelf vision foundation models trained on varied vision tasks. Theia's rich visual representations encode diverse visual knowledge, enhancing downstream robot learning. Extensive experiments demonstrate that Theia outperforms its teacher models and prior robot learning models using less training data and smaller model sizes. Additionally, we quantify the quality of pre-trained visual representations and hypothesize that higher entropy in feature norm distributions leads to improved robot learning performance. Code and models are available at https://github.com/bdaiinstitute/theia.

View arXiv page View PDF Add to collection

Community

Jinghuan

Paper author Paper submitter Jul 30

Theia builds a robot vision foundation model by distilling existing vision foundation models, which improves downstream robot learning performance, as well as has a smaller model size.