Vision Superalignment: Weak-to-Strong Generalization for Vision Foundation Models Paper • 2402.03749 • Published Feb 6 • 9
ScreenAI: A Vision-Language Model for UI and Infographics Understanding Paper • 2402.04615 • Published Feb 7 • 31
EfficientViT-SAM: Accelerated Segment Anything Model Without Performance Loss Paper • 2402.05008 • Published Feb 7 • 18
WebLINX: Real-World Website Navigation with Multi-Turn Dialogue Paper • 2402.05930 • Published Feb 8 • 35
SPHINX-X: Scaling Data and Parameters for a Family of Multi-modal Large Language Models Paper • 2402.05935 • Published Feb 8 • 11
ViGoR: Improving Visual Grounding of Large Vision Language Models with Fine-Grained Reward Modeling Paper • 2402.06118 • Published Feb 9 • 12
OS-Copilot: Towards Generalist Computer Agents with Self-Improvement Paper • 2402.07456 • Published Feb 12 • 38
PIVOT: Iterative Visual Prompting Elicits Actionable Knowledge for VLMs Paper • 2402.07872 • Published Feb 12 • 14
Prismatic VLMs: Investigating the Design Space of Visually-Conditioned Language Models Paper • 2402.07865 • Published Feb 12 • 11
World Model on Million-Length Video And Language With RingAttention Paper • 2402.08268 • Published Feb 13 • 33
PaLM2-VAdapter: Progressively Aligned Language Model Makes a Strong Vision-language Adapter Paper • 2402.10896 • Published Feb 16 • 13
FinTral: A Family of GPT-4 Level Multimodal Financial Large Language Models Paper • 2402.10986 • Published Feb 16 • 73
AnyGPT: Unified Multimodal LLM with Discrete Sequence Modeling Paper • 2402.12226 • Published Feb 19 • 37
Vision-Flan: Scaling Human-Labeled Tasks in Visual Instruction Tuning Paper • 2402.11690 • Published Feb 18 • 6
VideoPrism: A Foundational Visual Encoder for Video Understanding Paper • 2402.13217 • Published Feb 20 • 18
A Touch, Vision, and Language Dataset for Multimodal Alignment Paper • 2402.13232 • Published Feb 20 • 11
How Easy is It to Fool Your Multimodal LLMs? An Empirical Analysis on Deceptive Prompts Paper • 2402.13220 • Published Feb 20 • 12
BBA: Bi-Modal Behavioral Alignment for Reasoning with Large Vision-Language Models Paper • 2402.13577 • Published Feb 21 • 5
TinyLLaVA: A Framework of Small-scale Large Multimodal Models Paper • 2402.14289 • Published Feb 22 • 16
Sora: A Review on Background, Technology, Limitations, and Opportunities of Large Vision Models Paper • 2402.17177 • Published Feb 27 • 87
Panda-70M: Captioning 70M Videos with Multiple Cross-Modality Teachers Paper • 2402.19479 • Published Feb 29 • 30
MovieLLM: Enhancing Long Video Understanding with AI-Generated Movies Paper • 2403.01422 • Published Mar 3 • 24
InfiMM-HD: A Leap Forward in High-Resolution Multimodal Understanding Paper • 2403.01487 • Published Mar 3 • 14
Finetuned Multimodal Language Models Are High-Quality Image-Text Data Filters Paper • 2403.02677 • Published Mar 5 • 16
Modeling Collaborator: Enabling Subjective Vision Classification With Minimal Human Effort via LLM Tool-Use Paper • 2403.02626 • Published Mar 5 • 9
MAGID: An Automated Pipeline for Generating Synthetic Multi-modal Datasets Paper • 2403.03194 • Published Mar 5 • 11
Feast Your Eyes: Mixture-of-Resolution Adaptation for Multimodal Large Language Models Paper • 2403.03003 • Published Mar 5 • 8
MM1: Methods, Analysis & Insights from Multimodal LLM Pre-training Paper • 2403.09611 • Published Mar 14 • 119
MoAI: Mixture of All Intelligence for Large Language and Vision Models Paper • 2403.07508 • Published Mar 12 • 69
Synth^2: Boosting Visual-Language Models with Synthetic Captions and Image Embeddings Paper • 2403.07750 • Published Mar 12 • 19
DragAnything: Motion Control for Anything using Entity Representation Paper • 2403.07420 • Published Mar 12 • 11
An Image is Worth 1/2 Tokens After Layer 2: Plug-and-Play Inference Acceleration for Large Vision-Language Models Paper • 2403.06764 • Published Mar 11 • 24
VideoMamba: State Space Model for Efficient Video Understanding Paper • 2403.06977 • Published Mar 11 • 21
ELLA: Equip Diffusion Models with LLM for Enhanced Semantic Alignment Paper • 2403.05135 • Published Mar 8 • 39
Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context Paper • 2403.05530 • Published Mar 8 • 49
DeepSeek-VL: Towards Real-World Vision-Language Understanding Paper • 2403.05525 • Published Mar 8 • 36
VideoElevator: Elevating Video Generation Quality with Versatile Text-to-Image Diffusion Models Paper • 2403.05438 • Published Mar 8 • 14
Uni-SMART: Universal Science Multimodal Analysis and Research Transformer Paper • 2403.10301 • Published Mar 15 • 50
VideoAgent: Long-form Video Understanding with Large Language Model as Agent Paper • 2403.10517 • Published Mar 15 • 28
LLaVA-UHD: an LMM Perceiving Any Aspect Ratio and High-Resolution Images Paper • 2403.11703 • Published Mar 18 • 13
VideoAgent: A Memory-augmented Multimodal Agent for Video Understanding Paper • 2403.11481 • Published Mar 18 • 10
mPLUG-DocOwl 1.5: Unified Structure Learning for OCR-free Document Understanding Paper • 2403.12895 • Published Mar 19 • 27
Chart-based Reasoning: Transferring Capabilities from LLMs to VLMs Paper • 2403.12596 • Published Mar 19 • 9
MathVerse: Does Your Multi-modal LLM Truly See the Diagrams in Visual Math Problems? Paper • 2403.14624 • Published Mar 21 • 49
InternVideo2: Scaling Video Foundation Models for Multimodal Video Understanding Paper • 2403.15377 • Published Mar 22 • 16
SiMBA: Simplified Mamba-Based Architecture for Vision and Multivariate Time series Paper • 2403.15360 • Published Mar 22 • 11
Mini-Gemini: Mining the Potential of Multi-modality Vision Language Models Paper • 2403.18814 • Published Mar 27 • 37
TextCraftor: Your Text Encoder Can be Image Quality Controller Paper • 2403.18978 • Published Mar 27 • 12
Unsolvable Problem Detection: Evaluating Trustworthiness of Vision Language Models Paper • 2403.20331 • Published Mar 29 • 14
Getting it Right: Improving Spatial Consistency in Text-to-Image Models Paper • 2404.01197 • Published Apr 1 • 29
Direct Preference Optimization of Video Large Multimodal Models from Language Model Reward Paper • 2404.01258 • Published Apr 1 • 10
MiniGPT4-Video: Advancing Multimodal LLMs for Video Understanding with Interleaved Visual-Textual Tokens Paper • 2404.03413 • Published Apr 4 • 21
LVLM-Intrepret: An Interpretability Tool for Large Vision-Language Models Paper • 2404.03118 • Published Apr 3 • 17
CoMat: Aligning Text-to-Image Diffusion Model with Image-to-Text Concept Matching Paper • 2404.03653 • Published Apr 4 • 28
Ferret-UI: Grounded Mobile UI Understanding with Multimodal LLMs Paper • 2404.05719 • Published about 1 month ago • 56
MA-LMM: Memory-Augmented Large Multimodal Model for Long-Term Video Understanding Paper • 2404.05726 • Published about 1 month ago • 18
MoMA: Multimodal LLM Adapter for Fast Personalized Image Generation Paper • 2404.05674 • Published about 1 month ago • 9
InternLM-XComposer2-4KHD: A Pioneering Large Vision-Language Model Handling Resolutions from 336 Pixels to 4K HD Paper • 2404.06512 • Published 29 days ago • 29
BRAVE: Broadening the visual encoding of vision-language models Paper • 2404.07204 • Published 28 days ago • 14
Transferable and Principled Efficiency for Open-Vocabulary Segmentation Paper • 2404.07448 • Published 28 days ago • 8
Ferret-v2: An Improved Baseline for Referring and Grounding with Large Language Models Paper • 2404.07973 • Published 27 days ago • 28
HQ-Edit: A High-Quality Dataset for Instruction-based Image Editing Paper • 2404.09990 • Published 23 days ago • 11
TextHawk: Exploring Efficient Fine-Grained Perception of Multimodal Large Language Models Paper • 2404.09204 • Published 25 days ago • 10
On Speculative Decoding for Multimodal Large Language Models Paper • 2404.08856 • Published 26 days ago • 9
Reka Core, Flash, and Edge: A Series of Powerful Multimodal Language Models Paper • 2404.12387 • Published 20 days ago • 34
BLINK: Multimodal Large Language Models Can See but Not Perceive Paper • 2404.12390 • Published 20 days ago • 23
MultiBooth: Towards Generating All Your Concepts in an Image from Text Paper • 2404.14239 • Published 16 days ago • 7
TextSquare: Scaling up Text-Centric Visual Instruction Tuning Paper • 2404.12803 • Published 20 days ago • 27
Groma: Localized Visual Tokenization for Grounding Multimodal Large Language Models Paper • 2404.13013 • Published 19 days ago • 26
CatLIP: CLIP-level Visual Recognition Accuracy with 2.7x Faster Pre-training on Web-scale Image-Text Data Paper • 2404.15653 • Published 15 days ago • 24
SEED-Bench-2-Plus: Benchmarking Multimodal Large Language Models with Text-Rich Visual Comprehension Paper • 2404.16790 • Published 13 days ago • 7
How Far Are We to GPT-4V? Closing the Gap to Commercial Multimodal Models with Open-Source Suites Paper • 2404.16821 • Published 13 days ago • 47
List Items One by One: A New Data Source and Learning Paradigm for Multimodal LLMs Paper • 2404.16375 • Published 14 days ago • 14
PLLaVA : Parameter-free LLaVA Extension from Images to Videos for Video Dense Captioning Paper • 2404.16994 • Published 13 days ago • 29
HaLo-NeRF: Learning Geometry-Guided Semantics for Exploring Unconstrained Photo Collections Paper • 2404.16845 • Published Feb 14 • 5
BlenderAlchemy: Editing 3D Graphics with Vision-Language Models Paper • 2404.17672 • Published 12 days ago • 17
Ag2Manip: Learning Novel Manipulation Skills with Agent-Agnostic Visual and Action Representations Paper • 2404.17521 • Published 12 days ago • 12