Detect objects in images and get bounding boxes
A Foundation Action Model For Generalist GUI Agents
Generate text responses using images and text prompts
Generate clickable coordinates on a screenshot