Papers
arxiv:2305.08144

Mobile-Env: An Evaluation Platform and Benchmark for Interactive Agents in LLM Era

Published on May 14, 2023
Authors:
,
,

Abstract

Diverse evaluation benchmarks play a crucial role to assess a wide range of capabilities of large language models (LLM). Although plenty of endeavors have been dedicated to building valuable benchmarks, there is still little work aiming at evaluating the capability of LLM in multistep interactive environments. Noticing that LLM requires a text representation of the environment observations for interaction, we choose to fill such a blank by building a novel benchmark based on the information user interface (InfoUI). InfoUI consists of rich text contents and can be represented in some text formats, thus is suitable for the assessment of interaction ability of LLM. Additionally, the complex structures of InfoUI can further raise a challenge for LLM to understand structured texts rather than plain texts. An interaction platform is always used to evaluate an agent, however, there is still a lack of a satisfactory interaction platform dedicated to InfoUI. Consequently, we propose to build a novel easily-extendable, adaptable, and close-to-reality interaction platform, Mobile-Env, to provide a base for an appropriate benchmark. Based on Mobile-Env, an InfoUI task set WikiHow is then built to establish a benchmark for the multistep interaction capability of LLM in structured text-based environments. Agents based on a series of LLMs are tested on the task set to obtain an insight into the potential and challenge of LLM for InfoUI interaction. It is sincerely welcome that the community contribute new environments and new task sets for Mobile-Env to provide better test benchmarks and facilitate the development of the corresponding domains.

Community

Sign up or log in to comment

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2305.08144 in a model README.md to link it from this page.

Datasets citing this paper 1

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2305.08144 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.