How Far Are We to GPT-4V? Closing the Gap to Commercial Multimodal Models with Open-Source Suites Paper • 2404.16821 • Published 25 days ago • 49
Phi-3 Technical Report: A Highly Capable Language Model Locally on Your Phone Paper • 2404.14219 • Published 28 days ago • 230
InternLM-XComposer2-4KHD: A Pioneering Large Vision-Language Model Handling Resolutions from 336 Pixels to 4K HD Paper • 2404.06512 • Published Apr 9 • 29
Are We on the Right Way for Evaluating Large Vision-Language Models? Paper • 2403.20330 • Published Mar 29 • 6
InternLM-XComposer2: Mastering Free-form Text-Image Composition and Comprehension in Vision-Language Large Model Paper • 2401.16420 • Published Jan 29 • 54
FreeControl: Training-Free Spatial Control of Any Text-to-Image Diffusion Model with Any Condition Paper • 2312.07536 • Published Dec 12, 2023 • 15
Alpha-CLIP: A CLIP Model Focusing on Wherever You Want Paper • 2312.03818 • Published Dec 6, 2023 • 31
GPT4Point: A Unified Framework for Point-Language Understanding and Generation Paper • 2312.02980 • Published Dec 5, 2023 • 6
ShareGPT4V: Improving Large Multi-Modal Models with Better Captions Paper • 2311.12793 • Published Nov 21, 2023 • 18