An Image is Worth More Than 16x16 Patches: Exploring Transformers on Individual Pixels Paper • 2406.09415 • Published 28 days ago • 47
4M-21: An Any-to-Any Vision Model for Tens of Tasks and Modalities Paper • 2406.09406 • Published 28 days ago • 12
VideoGUI: A Benchmark for GUI Automation from Instructional Videos Paper • 2406.10227 • Published 27 days ago • 8
What If We Recaption Billions of Web Images with LLaMA-3? Paper • 2406.08478 • Published 29 days ago • 38