Agent Loop
Hey,
how does the model in ui navigation mode is supposed to work in a typical agent loop? Shall I provide the past steps and the image while the agent moves on a trajectory? Can you please provide a mvp example for android?
@Maverick17 Hi Yes!
What you need to do just provide the output of the past as the history of now.
btw, What you mean for the `mvp'?
Hello @KevinQHLin ,
mvp stands for "Minimum Viable Product", so in your case it is basically a runnable code snippet for the agent (for example mobile on android emulator)...
So, in an agent scenario, I would have this messages, e.g.:
"messages": [
{
"role": "user",
"content": [{"type": "text", "text": step_1}],
},
{
"type": "image_url",
"image_url": {
"url": f"data:image/jpeg;base64,{retrieve_screenshot()}"
},
},
{
"role": "assistant",
"content": [{"type": "text", "text": action_step_1}]
},
{
"role": "user",
"content": [{"type": "text", "text": [past_steps, new_step]}],
},
{
"type": "image_url",
"image_url": {
"url": f"data:image/jpeg;base64,{retrieve_screenshot()}"
},
},
{
"role": "assistant",
"content": [{"type": "text", "text": action_step_2 }]
},
]
stacked up, right?
I just want to be sure, that I'm using this model the way it has been trained...
Thanks in advance!
@Maverick17 Hi, could you please refer to this github issue :)
Well, but this agent simply cannot work well, because of:
- It's small size and the fact that you guys trained without any reasoning traces (I mean a comprehensive description of what the vlm sees and how it should pursue the target)
- You successfully trained the wrong json format (at least from the pure python perspective, e.g. json.loads(..)):
{'action': 'CLICK', 'value': None, 'position': [0.49, 0.42]}
Don't get me wrong, it's a nice language action model that can be adapted to other domains very well, but for a "real" agent, something like a planner (gpt4o) is definitely needed...
- I guess the reasoning traces you mention is about the CoT. Yeap, This version I do not included yet. But it will be important.
- For this json string, what we use is like this:
action = '{'action': 'CLICK', 'value': None, 'position': [0.49, 0.42]}'
eval(action)
Then you will be able to obtain a dict with structural information.
Lastly, we really appreciate your helpful comments, which will help us to improve our work :D
@KevinQHLin Yes, I did mention the CoT.
Sure, I am also aware of that eval method (alternatively you might call functions from the ast lib). However, I think it's a bad design choice because you are limited to the python ecosystem. What about other programming languages?
Have you guys also thought for training a model from scratch? I mean something like the TinyLlava approach?