Agent Loop

#6
by Maverick17 - opened

Hey,

how does the model in ui navigation mode is supposed to work in a typical agent loop? Shall I provide the past steps and the image while the agent moves on a trajectory? Can you please provide a mvp example for android?

@Maverick17 Hi Yes!

What you need to do just provide the output of the past as the history of now.

btw, What you mean for the `mvp'?

Hello @KevinQHLin ,

mvp stands for "Minimum Viable Product", so in your case it is basically a runnable code snippet for the agent (for example mobile on android emulator)...

So, in an agent scenario, I would have this messages, e.g.:

"messages": [
    {
        "role": "user",
        "content": [{"type": "text", "text": step_1}],
    },
    {
        "type": "image_url",
        "image_url": {
            "url": f"data:image/jpeg;base64,{retrieve_screenshot()}"
        },
    },
    {
        "role": "assistant",
        "content": [{"type": "text", "text": action_step_1}]
    },
    {
        "role": "user",
        "content": [{"type": "text", "text": [past_steps, new_step]}],
    },
    {
        "type": "image_url",
        "image_url": {
            "url": f"data:image/jpeg;base64,{retrieve_screenshot()}"
        },
    },
    {
        "role": "assistant",
        "content": [{"type": "text", "text": action_step_2 }]
    },
]

stacked up, right?

I just want to be sure, that I'm using this model the way it has been trained...

Thanks in advance!

Maverick17 changed discussion status to closed
Maverick17 changed discussion status to open
Show Lab org

@Maverick17 Hi, could you please refer to this github issue :)

https://github.com/showlab/ShowUI/issues/5

@KevinQHLin

Well, but this agent simply cannot work well, because of:

  1. It's small size and the fact that you guys trained without any reasoning traces (I mean a comprehensive description of what the vlm sees and how it should pursue the target)
  2. You successfully trained the wrong json format (at least from the pure python perspective, e.g. json.loads(..)):
{'action': 'CLICK', 'value': None, 'position': [0.49, 0.42]}

Don't get me wrong, it's a nice language action model that can be adapted to other domains very well, but for a "real" agent, something like a planner (gpt4o) is definitely needed...

Show Lab org
  1. I guess the reasoning traces you mention is about the CoT. Yeap, This version I do not included yet. But it will be important.
  2. For this json string, what we use is like this:
action = '{'action': 'CLICK', 'value': None, 'position': [0.49, 0.42]}'
eval(action)

Then you will be able to obtain a dict with structural information.

Lastly, we really appreciate your helpful comments, which will help us to improve our work :D

@KevinQHLin Yes, I did mention the CoT.

Sure, I am also aware of that eval method (alternatively you might call functions from the ast lib). However, I think it's a bad design choice because you are limited to the python ecosystem. What about other programming languages?

Have you guys also thought for training a model from scratch? I mean something like the TinyLlava approach?

Sign up or log in to comment