A team from Tsinghua University just released AndroidLab, the first systematic framework to evaluate and train Android mobile agents that works with both text-only and multimodal models.
They show that fine-tuning small open-source models can significantly boost performance, matching that of much bigger closed models like GPT-4o.
The team built:
๐ย A reproducible benchmark with 138 tasks across 9 apps to evaluate mobile agents systematically
๐๐ฑย A framework supporting both text-only (via XML) and visual (via marked screenshots) interfaces
โ ย An instruction dataset of 10.5k operation traces for training mobile agents
Key insights:
- ๐ Fine-tuning improves performance BY A LOT: Open-source model Llama-3.1-8B improves from 2% to 24% success rate after training, nearly reaching GPT-4o performance although itโs much smaller - โ๏ธ Text-only agents match multimodal ones: XML-based agents achieve similar performance to screenshot-based multimodal agents.
I just had a masterclass in open-source collaboration with the release of Llama 3.1 ๐ฆ๐ค
Meta dropped Llama 3.1, and seeing firsthand the Hugging Face team working to integrate it is nothing short of impressive. Their swift integration, comprehensive documentation, and innovative tools showcase the power of open-source teamwork.