arxiv:2311.07562

GPT-4V in Wonderland: Large Multimodal Models for Zero-Shot Smartphone GUI Navigation

Published on Nov 13, 2023

· Submitted by

akhaliq on Nov 14, 2023

Upvote

Authors:

An Yan ,

Zhengyuan Yang ,

Wanrong Zhu ,

Kevin Lin ,

Linjie Li ,

Jianfeng Wang ,

Jianwei Yang ,

Yiwu Zhong ,

Jianfeng Gao ,

Lijuan Wang

Abstract

We present MM-Navigator, a GPT-4V-based agent for the smartphone graphical user interface (GUI) navigation task. MM-Navigator can interact with a smartphone screen as human users, and determine subsequent actions to fulfill given instructions. Our findings demonstrate that large multimodal models (LMMs), specifically GPT-4V, excel in zero-shot GUI navigation through its advanced screen interpretation, action reasoning, and precise action localization capabilities. We first benchmark MM-Navigator on our collected iOS screen dataset. According to human assessments, the system exhibited a 91\% accuracy rate in generating reasonable action descriptions and a 75\% accuracy rate in executing the correct actions for single-step instructions on iOS. Additionally, we evaluate the model on a subset of an Android screen navigation dataset, where the model outperforms previous GUI navigators in a zero-shot fashion. Our benchmark and detailed analyses aim to lay a robust groundwork for future research into the GUI navigation task. The project page is at https://github.com/zzxslp/MM-Navigator.

View arXiv page View PDF Add to collection

Community

mikelabs

Nov 15, 2023

•

edited Nov 15, 2023

My summary:

Enabling AI to navigate and interact with smartphone UIs is hard, requiring a model that goes beyond mere text processing to handle intricate visual and interactive tasks. A new paper proposes MM-Navigator, an agent based on GPT-4V that can use an iPhone and make purchases on the Amazon app.

The agent can "understand" and interact with smartphone interfaces in a much more human-like manner than previous attempts.

The key innovation lies in GPT-4V's ability to process both text and image inputs. The agent takes a user's text instructions and the current screen image, then outputs a description of the next action, including precise screen locations. The researchers improved interaction accuracy by adding numeric tags to interactive elements on the screen, which GPT-4V references to indicate specific actions.

Testing on iOS and Android datasets showed promising results. GPT-4V's actions were correct 75% of the time for iOS screens, a notable achievement in visual grounding. A standout example was its successful navigation through various apps to purchase a milk frother on Amazon within a set budget.

There are limitations:

False negatives often arose from dataset or annotation issues. These errors often stem from issues with the dataset or the annotation process. In some cases, GPT-4V's predictions are correct but are marked as incorrect due to inaccuracies in Set-of-Mark annotation parsing or because the dataset annotations are imperfect.
True negatives highlighted limitations in the model's zero-shot testing approach. Without examples to guide its understanding of user action patterns, the model tends to prefer clicking over scrolling, leading to decisions that don't align with typical human actions.

If these limitations can be reduced, I could see this being useful for automating QA testing or assisting individuals with disabilities. This research underscores the complexities of developing AI for such sophisticated tasks and emphasizes the importance of accurate data and adaptable testing methods.

TLDR: MM-Navigator is an agent that can navigate a smartphone, combining text and image processing to interact with GUIs. Promising but still has plenty of flaws.

Full summary here