TinyLLaVA: A Framework of Small-scale Large Multimodal Models Paper • 2402.14289 • Published Feb 22 • 19
DocLLM: A layout-aware generative language model for multimodal document understanding Paper • 2401.00908 • Published Dec 31, 2023 • 178
Multimodal Contrastive Learning with LIMoE: the Language-Image Mixture of Experts Paper • 2206.02770 • Published Jun 6, 2022 • 3
MoAI: Mixture of All Intelligence for Large Language and Vision Models Paper • 2403.07508 • Published Mar 12 • 75
mPLUG-Owl: Modularization Empowers Large Language Models with Multimodality Paper • 2304.14178 • Published Apr 27, 2023 • 2
Gemini: A Family of Highly Capable Multimodal Models Paper • 2312.11805 • Published Dec 19, 2023 • 45
Flamingo: a Visual Language Model for Few-Shot Learning Paper • 2204.14198 • Published Apr 29, 2022 • 14
AnyGPT: Unified Multimodal LLM with Discrete Sequence Modeling Paper • 2402.12226 • Published Feb 19 • 40
Unifying Vision, Text, and Layout for Universal Document Processing Paper • 2212.02623 • Published Dec 5, 2022 • 10
Uni-SMART: Universal Science Multimodal Analysis and Research Transformer Paper • 2403.10301 • Published Mar 15 • 51
VideoAgent: Long-form Video Understanding with Large Language Model as Agent Paper • 2403.10517 • Published Mar 15 • 30
LLaVA-UHD: an LMM Perceiving Any Aspect Ratio and High-Resolution Images Paper • 2403.11703 • Published Mar 18 • 16
TexDreamer: Towards Zero-Shot High-Fidelity 3D Human Texture Generation Paper • 2403.12906 • Published Mar 19 • 4
HyperLLaVA: Dynamic Visual and Language Expert Tuning for Multimodal Large Language Models Paper • 2403.13447 • Published Mar 20 • 17
MathVerse: Does Your Multi-modal LLM Truly See the Diagrams in Visual Math Problems? Paper • 2403.14624 • Published Mar 21 • 50
A Multimodal Approach to Device-Directed Speech Detection with Large Language Models Paper • 2403.14438 • Published Mar 21 • 2
InternVideo2: Scaling Video Foundation Models for Multimodal Video Understanding Paper • 2403.15377 • Published Mar 22 • 21
Mini-Gemini: Mining the Potential of Multi-modality Vision Language Models Paper • 2403.18814 • Published Mar 27 • 44
FormNetV2: Multimodal Graph Contrastive Learning for Form Document Information Extraction Paper • 2305.02549 • Published May 4, 2023 • 6
Qwen-VL: A Frontier Large Vision-Language Model with Versatile Abilities Paper • 2308.12966 • Published Aug 24, 2023 • 6
LVLM-Intrepret: An Interpretability Tool for Large Vision-Language Models Paper • 2404.03118 • Published Apr 3 • 23
No "Zero-Shot" Without Exponential Data: Pretraining Concept Frequency Determines Multimodal Model Performance Paper • 2404.04125 • Published Apr 4 • 27
Ferret-UI: Grounded Mobile UI Understanding with Multimodal LLMs Paper • 2404.05719 • Published Apr 8 • 62
Ferret-v2: An Improved Baseline for Referring and Grounding with Large Language Models Paper • 2404.07973 • Published Apr 11 • 30
BLINK: Multimodal Large Language Models Can See but Not Perceive Paper • 2404.12390 • Published Apr 18 • 24
Data curation via joint example selection further accelerates multimodal learning Paper • 2406.17711 • Published Jun 25 • 3
The Synergy between Data and Multi-Modal Large Language Models: A Survey from Co-Development Perspective Paper • 2407.08583 • Published Jul 11 • 10
Multimodal Self-Instruct: Synthetic Abstract Image and Visual Reasoning Instruction Using Language Model Paper • 2407.07053 • Published Jul 9 • 41
VLMEvalKit: An Open-Source Toolkit for Evaluating Large Multi-Modality Models Paper • 2407.11691 • Published Jul 16 • 13
Openstory++: A Large-scale Dataset and Benchmark for Instance-aware Open-domain Visual Storytelling Paper • 2408.03695 • Published Aug 7 • 11
mPLUG-Owl3: Towards Long Image-Sequence Understanding in Multi-Modal Large Language Models Paper • 2408.04840 • Published Aug 9 • 31
xGen-MM (BLIP-3): A Family of Open Large Multimodal Models Paper • 2408.08872 • Published Aug 16 • 96