LLaVAR: Enhanced Visual Instruction Tuning for Text-Rich Image Understanding Paper • 2306.17107 • Published Jun 29, 2023 • 11
Qwen-VL: A Frontier Large Vision-Language Model with Versatile Abilities Paper • 2308.12966 • Published Aug 24, 2023 • 6
MoE-LLaVA: Mixture of Experts for Large Vision-Language Models Paper • 2401.15947 • Published Jan 29 • 46
DocPedia: Unleashing the Power of Large Multimodal Model in the Frequency Domain for Versatile Document Understanding Paper • 2311.11810 • Published Nov 20, 2023 • 1
Pix2Struct: Screenshot Parsing as Pretraining for Visual Language Understanding Paper • 2210.03347 • Published Oct 7, 2022 • 3
Reading Order Matters: Information Extraction from Visually-rich Documents by Token Path Prediction Paper • 2310.11016 • Published Oct 17, 2023
Nougat: Neural Optical Understanding for Academic Documents Paper • 2308.13418 • Published Aug 25, 2023 • 33
MoAI: Mixture of All Intelligence for Large Language and Vision Models Paper • 2403.07508 • Published Mar 12 • 69
ScreenAI: A Vision-Language Model for UI and Infographics Understanding Paper • 2402.04615 • Published Feb 7 • 31