Agent Loop

by Maverick17 - opened Dec 4, 2024

Dec 4, 2024

Hey,

how does the model in ui navigation mode is supposed to work in a typical agent loop? Shall I provide the past steps and the image while the agent moves on a trajectory? Can you please provide a mvp example for android?

KevinQHLin

Show Lab org Dec 5, 2024

•

edited Dec 5, 2024

@Maverick17 Hi Yes!

What you need to do just provide the output of the past as the history of now.

btw, What you mean for the `mvp'?

Maverick17

Dec 5, 2024

Hello @KevinQHLin ,

mvp stands for "Minimum Viable Product", so in your case it is basically a runnable code snippet for the agent (for example mobile on android emulator)...

So, in an agent scenario, I would have this messages, e.g.:

"messages": [
    {
        "role": "user",
        "content": [{"type": "text", "text": step_1}],
    },
    {
        "type": "image_url",
        "image_url": {
            "url": f"data:image/jpeg;base64,{retrieve_screenshot()}"
        },
    },
    {
        "role": "assistant",
        "content": [{"type": "text", "text": action_step_1}]
    },
    {
        "role": "user",
        "content": [{"type": "text", "text": [past_steps, new_step]}],
    },
    {
        "type": "image_url",
        "image_url": {
            "url": f"data:image/jpeg;base64,{retrieve_screenshot()}"
        },
    },
    {
        "role": "assistant",
        "content": [{"type": "text", "text": action_step_2 }]
    },
]

stacked up, right?

I just want to be sure, that I'm using this model the way it has been trained...

Thanks in advance!

Maverick17 changed discussion status to closed Dec 5, 2024

Maverick17 changed discussion status to open Dec 5, 2024

KevinQHLin

Show Lab org Dec 5, 2024

@Maverick17 Hi, could you please refer to this github issue :)

https://github.com/showlab/ShowUI/issues/5

Maverick17

Dec 5, 2024

@KevinQHLin

Well, but this agent simply cannot work well, because of:

It's small size and the fact that you guys trained without any reasoning traces (I mean a comprehensive description of what the vlm sees and how it should pursue the target)
You successfully trained the wrong json format (at least from the pure python perspective, e.g. json.loads(..)):

{'action': 'CLICK', 'value': None, 'position': [0.49, 0.42]}

Don't get me wrong, it's a nice language action model that can be adapted to other domains very well, but for a "real" agent, something like a planner (gpt4o) is definitely needed...

KevinQHLin

Show Lab org Dec 6, 2024

I guess the reasoning traces you mention is about the CoT. Yeap, This version I do not included yet. But it will be important.
For this json string, what we use is like this:

action = '{'action': 'CLICK', 'value': None, 'position': [0.49, 0.42]}'
eval(action)

Then you will be able to obtain a dict with structural information.

Lastly, we really appreciate your helpful comments, which will help us to improve our work :D

Maverick17

Dec 6, 2024

@KevinQHLin Yes, I did mention the CoT.

Sure, I am also aware of that eval method (alternatively you might call functions from the ast lib). However, I think it's a bad design choice because you are limited to the python ecosystem. What about other programming languages?

Have you guys also thought for training a model from scratch? I mean something like the TinyLlava approach?

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

Your need to confirm your account before you can post a new comment.

· Sign up or log in to comment