Agent-Eval-Refine/Captioner · Apply for community grant: Academic project (gpu)

Agent-Eval-Refine org Apr 4

This is an open-source, dense graphical user interface captioner (9B, fine-tuned from QWen-VL), which will be part of our paper release next week on the autonomous evaluation and refinement of digital agents. We also release the fine-tuning dataset on HF

In the paper, we demonstrate how using this captioner in conjunction with Mixtral serves as a competitive method alongside GPT4V for evaluating web/mobile agents and enhancing their performance.

Title: Autonomous Evaluation and Refinement of Digital Agents
Abstract:
We show that domain-general automated evaluators can significantly improve the performance of digital agents, without requiring access to any in-domain demonstration data or oracle evaluation metrics.
Our proposed approach autonomously evaluates and improves the performance of such language-conditioned digital agents, which complete user commands by executing a sequence of actions in a digital environment, such as a web browser or mobile device.
While recently-proposed benchmarks and environments have pushed forward the development of these agents, the nature of benchmarks limits their realism and domain coverage.
To support development of domain-general digital agents, we explore the design and use of automated evaluator models to both evaluate and autonomously refine the performance of digital agents.
We experiment with multiple evaluator models that trade off between inference cost, modularity of design, and accuracy.
We validate the performance of these models in several popular benchmarks for digital agents, finding between 74.4 and 92.9% agreement with oracle evaluation metrics.
Finally, we experiment with using these automated evaluators to improve the performance of existing digital agents via inference-time guidance and filtered behavioral cloning.
We find that without requiring any demonstration data or domain-specific evaluation, we enhance the state-of-the-art agent's performance by 29% on the popular benchmark WebArena, and achieve a 75% improvement in a different domain transfer scenario.

hysts

Apr 5

Hi @Jiayi-Pan , we've assigned ZeroGPU to this Space. Please check the compatibility and usage sections of this page so your Space can run on ZeroGPU.

Jiayi-Pan

Agent-Eval-Refine org Apr 5

works like a charm, thank you!

Jiayi-Pan changed discussion status to closed Apr 5