UI Referring Expression Task on Donut
Could you maybe explain your project a little further?
Certainly. The goal is to test which multi-modal visual-language models can adapt to tasks such as UI Referring Expressions. These tasks have been previously addressed by specialized models such as UIBert, seq2act and pix2struct. I have some preliminary warmup results that can be tested in this space. Training is taking awhile on the basic colab GPUs.
Here is the working draft of the training colab notebooks:
In this notebook, we'll fine-tune Donut (which is an instance of VisionEncoderDecoderModel) on a UI RefExp dataset (UI Bert, Rico), which is a dataset consisting of (UI screenshot, prompt, and target bounding box) triplets. This way, the model will learn to look at a screenshot image, and answer a prompt referring to a UI component. For example: "select the search icon next to the menu drawer". This could be useful for tasks such as converting natural language app documentation to executable tests, bug reporting front end test automation and app support chat bots.
Hope this helps.