arxiv:2311.06430

GOAT: GO to Any Thing

Published on Nov 10, 2023

· Submitted by

akhaliq on Nov 14, 2023

Upvote

Authors:

Theophile Gervet ,

Mukul Khanna ,

Dhruv Shah ,

Kavit Shah ,

Chris Paxton ,

Saurabh Gupta ,

Dhruv Batra ,

Jitendra Malik ,

Devendra Singh Chaplot

Abstract

In deployment scenarios such as homes and warehouses, mobile robots are expected to autonomously navigate for extended periods, seamlessly executing tasks articulated in terms that are intuitively understandable by human operators. We present GO To Any Thing (GOAT), a universal navigation system capable of tackling these requirements with three key features: a) Multimodal: it can tackle goals specified via category labels, target images, and language descriptions, b) Lifelong: it benefits from its past experience in the same environment, and c) Platform Agnostic: it can be quickly deployed on robots with different embodiments. GOAT is made possible through a modular system design and a continually augmented instance-aware semantic memory that keeps track of the appearance of objects from different viewpoints in addition to category-level semantics. This enables GOAT to distinguish between different instances of the same category to enable navigation to targets specified by images and language descriptions. In experimental comparisons spanning over 90 hours in 9 different homes consisting of 675 goals selected across 200+ different object instances, we find GOAT achieves an overall success rate of 83%, surpassing previous methods and ablations by 32% (absolute improvement). GOAT improves with experience in the environment, from a 60% success rate at the first goal to a 90% success after exploration. In addition, we demonstrate that GOAT can readily be applied to downstream tasks such as pick and place and social navigation.

View arXiv page View PDF Add to collection

Community

julien-c

Nov 16, 2023

Awesome paper title 🔥

TheProjectsGuy

Nov 23, 2023

•

edited Nov 23, 2023

Proposes GOAT (Go to anything): a universal navigation system with multimodal goal query (label, image, language description), lifelong (past experiences, episodic memory from previous goals), platform/robot agnostic; has instance-aware semantic memory. CLIP for image-language matching, SuperGlue for image-image matching. Store nodes of object instances with images and location in a top-down map of the environment; if query found in this: point-nav setting, else: exploration goal.
Better than CLIP on Wheels (COW) for success rate and SPL (Success weighed by Path Length - agent’s path length over optimal path length) metrics on visually diverse homes (custom dataset) using Spot quadruped and Stretch; GOAT without memory (reset semantic map every query episode) is still better than COW (with memory, it improves further).
- Benefits from: Modular system, SuperGlue keypoint correspondences giving geometric verification for image-to-image matching, filtering goal categories (match only within goal category).
Agent takes RGB-D and pose, perception stack contains instance segmentation, depth estimation, geometric projection, and dynamic instance mapping; semantic map and object instance memory are agent state; global policy takes this, goal, and generates long-term goal (in the map); local policy takes the map and gives actions.
- MaskRCNN (ResNet backbone, MS-COCO) for object detection and instance segmentation on RGB; fill holes by predicting depth (MiDaS), solving least squares for scale and position offset (with using actual depth readings), voxelize semantic map and project to BEV (top-view); got a KxMxM semantic map representation (K is object classes/categories + obstacles, explored area, current, and past locations) for each timestamp.
- Object instance memory is through object (instance) locations in (semantic) map and storing images (with local matching done).
Global policy for getting to goal: Category goals are checked in semantic map; language goals go through Mistral-7B and then CLIP features (language-image) are matched in Object Instance Memory; image goals are matched by category and then SuperGlue matching; if found in memory, then use the exact location as point navigation goal, else use frontier-based exploration (explore closest unexplored region).
- Local policy is Fast Marching Method (waypoint controller in Spot, custom controller for Stretch). Supplementary material has instance matching strategy selection (CLIP vs SuperGlue, thresholds for matching, category filtering, etc.), quantitative and qualitative results for the same.

From UIUC (Saurabh Gupta), CMU, Georgia Tech, UC Berkeley (Jitendra Malik), Meta (Dhruv Batra) and Mistral (DS Chaplot).

Links: arxiv, website, GitHub

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

Your need to confirm your account before you can post a new comment.

· Sign up or log in to comment

Upvote

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2311.06430 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2311.06430 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2311.06430 in a Space README.md to link it from this page.