Vision Superalignment: Weak-to-Strong Generalization for Vision Foundation Models Paper • 2402.03749 • Published Feb 6 • 9
ScreenAI: A Vision-Language Model for UI and Infographics Understanding Paper • 2402.04615 • Published Feb 7 • 33
EfficientViT-SAM: Accelerated Segment Anything Model Without Performance Loss Paper • 2402.05008 • Published Feb 7 • 19
WebLINX: Real-World Website Navigation with Multi-Turn Dialogue Paper • 2402.05930 • Published Feb 8 • 35
SPHINX-X: Scaling Data and Parameters for a Family of Multi-modal Large Language Models Paper • 2402.05935 • Published Feb 8 • 12
ViGoR: Improving Visual Grounding of Large Vision Language Models with Fine-Grained Reward Modeling Paper • 2402.06118 • Published Feb 9 • 13
OS-Copilot: Towards Generalist Computer Agents with Self-Improvement Paper • 2402.07456 • Published Feb 12 • 39
PIVOT: Iterative Visual Prompting Elicits Actionable Knowledge for VLMs Paper • 2402.07872 • Published Feb 12 • 14
Prismatic VLMs: Investigating the Design Space of Visually-Conditioned Language Models Paper • 2402.07865 • Published Feb 12 • 11
World Model on Million-Length Video And Language With RingAttention Paper • 2402.08268 • Published Feb 13 • 35
PaLM2-VAdapter: Progressively Aligned Language Model Makes a Strong Vision-language Adapter Paper • 2402.10896 • Published Feb 16 • 14
FinTral: A Family of GPT-4 Level Multimodal Financial Large Language Models Paper • 2402.10986 • Published Feb 16 • 74
AnyGPT: Unified Multimodal LLM with Discrete Sequence Modeling Paper • 2402.12226 • Published Feb 19 • 37
Vision-Flan: Scaling Human-Labeled Tasks in Visual Instruction Tuning Paper • 2402.11690 • Published Feb 18 • 6
VideoPrism: A Foundational Visual Encoder for Video Understanding Paper • 2402.13217 • Published Feb 20 • 19
A Touch, Vision, and Language Dataset for Multimodal Alignment Paper • 2402.13232 • Published Feb 20 • 11
How Easy is It to Fool Your Multimodal LLMs? An Empirical Analysis on Deceptive Prompts Paper • 2402.13220 • Published Feb 20 • 12
BBA: Bi-Modal Behavioral Alignment for Reasoning with Large Vision-Language Models Paper • 2402.13577 • Published Feb 21 • 5
TinyLLaVA: A Framework of Small-scale Large Multimodal Models Paper • 2402.14289 • Published Feb 22 • 17
Sora: A Review on Background, Technology, Limitations, and Opportunities of Large Vision Models Paper • 2402.17177 • Published Feb 27 • 88
Panda-70M: Captioning 70M Videos with Multiple Cross-Modality Teachers Paper • 2402.19479 • Published Feb 29 • 30
MovieLLM: Enhancing Long Video Understanding with AI-Generated Movies Paper • 2403.01422 • Published Mar 3 • 24
InfiMM-HD: A Leap Forward in High-Resolution Multimodal Understanding Paper • 2403.01487 • Published Mar 3 • 14
Finetuned Multimodal Language Models Are High-Quality Image-Text Data Filters Paper • 2403.02677 • Published Mar 5 • 16
Modeling Collaborator: Enabling Subjective Vision Classification With Minimal Human Effort via LLM Tool-Use Paper • 2403.02626 • Published Mar 5 • 9
MAGID: An Automated Pipeline for Generating Synthetic Multi-modal Datasets Paper • 2403.03194 • Published Mar 5 • 11
Feast Your Eyes: Mixture-of-Resolution Adaptation for Multimodal Large Language Models Paper • 2403.03003 • Published Mar 5 • 8
MM1: Methods, Analysis & Insights from Multimodal LLM Pre-training Paper • 2403.09611 • Published Mar 14 • 123
MoAI: Mixture of All Intelligence for Large Language and Vision Models Paper • 2403.07508 • Published Mar 12 • 73
Synth^2: Boosting Visual-Language Models with Synthetic Captions and Image Embeddings Paper • 2403.07750 • Published Mar 12 • 19
DragAnything: Motion Control for Anything using Entity Representation Paper • 2403.07420 • Published Mar 12 • 12
An Image is Worth 1/2 Tokens After Layer 2: Plug-and-Play Inference Acceleration for Large Vision-Language Models Paper • 2403.06764 • Published Mar 11 • 24
VideoMamba: State Space Model for Efficient Video Understanding Paper • 2403.06977 • Published Mar 11 • 24
ELLA: Equip Diffusion Models with LLM for Enhanced Semantic Alignment Paper • 2403.05135 • Published Mar 8 • 40
Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context Paper • 2403.05530 • Published Mar 8 • 51
DeepSeek-VL: Towards Real-World Vision-Language Understanding Paper • 2403.05525 • Published Mar 8 • 39
VideoElevator: Elevating Video Generation Quality with Versatile Text-to-Image Diffusion Models Paper • 2403.05438 • Published Mar 8 • 15
Uni-SMART: Universal Science Multimodal Analysis and Research Transformer Paper • 2403.10301 • Published Mar 15 • 50
VideoAgent: Long-form Video Understanding with Large Language Model as Agent Paper • 2403.10517 • Published Mar 15 • 28
LLaVA-UHD: an LMM Perceiving Any Aspect Ratio and High-Resolution Images Paper • 2403.11703 • Published Mar 18 • 13
VideoAgent: A Memory-augmented Multimodal Agent for Video Understanding Paper • 2403.11481 • Published Mar 18 • 10
mPLUG-DocOwl 1.5: Unified Structure Learning for OCR-free Document Understanding Paper • 2403.12895 • Published Mar 19 • 28
Chart-based Reasoning: Transferring Capabilities from LLMs to VLMs Paper • 2403.12596 • Published Mar 19 • 9
MathVerse: Does Your Multi-modal LLM Truly See the Diagrams in Visual Math Problems? Paper • 2403.14624 • Published Mar 21 • 50
InternVideo2: Scaling Video Foundation Models for Multimodal Video Understanding Paper • 2403.15377 • Published Mar 22 • 17
SiMBA: Simplified Mamba-Based Architecture for Vision and Multivariate Time series Paper • 2403.15360 • Published Mar 22 • 11
Mini-Gemini: Mining the Potential of Multi-modality Vision Language Models Paper • 2403.18814 • Published Mar 27 • 42
TextCraftor: Your Text Encoder Can be Image Quality Controller Paper • 2403.18978 • Published Mar 27 • 12
Unsolvable Problem Detection: Evaluating Trustworthiness of Vision Language Models Paper • 2403.20331 • Published Mar 29 • 14
Getting it Right: Improving Spatial Consistency in Text-to-Image Models Paper • 2404.01197 • Published Apr 1 • 29
Direct Preference Optimization of Video Large Multimodal Models from Language Model Reward Paper • 2404.01258 • Published Apr 1 • 10
MiniGPT4-Video: Advancing Multimodal LLMs for Video Understanding with Interleaved Visual-Textual Tokens Paper • 2404.03413 • Published Apr 4 • 22
LVLM-Intrepret: An Interpretability Tool for Large Vision-Language Models Paper • 2404.03118 • Published Apr 3 • 19
CoMat: Aligning Text-to-Image Diffusion Model with Image-to-Text Concept Matching Paper • 2404.03653 • Published Apr 4 • 29
Ferret-UI: Grounded Mobile UI Understanding with Multimodal LLMs Paper • 2404.05719 • Published Apr 8 • 58
MA-LMM: Memory-Augmented Large Multimodal Model for Long-Term Video Understanding Paper • 2404.05726 • Published Apr 8 • 18
MoMA: Multimodal LLM Adapter for Fast Personalized Image Generation Paper • 2404.05674 • Published Apr 8 • 12
InternLM-XComposer2-4KHD: A Pioneering Large Vision-Language Model Handling Resolutions from 336 Pixels to 4K HD Paper • 2404.06512 • Published Apr 9 • 29
BRAVE: Broadening the visual encoding of vision-language models Paper • 2404.07204 • Published Apr 10 • 14
Transferable and Principled Efficiency for Open-Vocabulary Segmentation Paper • 2404.07448 • Published Apr 11 • 10
Ferret-v2: An Improved Baseline for Referring and Grounding with Large Language Models Paper • 2404.07973 • Published Apr 11 • 28
HQ-Edit: A High-Quality Dataset for Instruction-based Image Editing Paper • 2404.09990 • Published Apr 15 • 11
TextHawk: Exploring Efficient Fine-Grained Perception of Multimodal Large Language Models Paper • 2404.09204 • Published Apr 14 • 10
On Speculative Decoding for Multimodal Large Language Models Paper • 2404.08856 • Published Apr 13 • 12
Reka Core, Flash, and Edge: A Series of Powerful Multimodal Language Models Paper • 2404.12387 • Published Apr 18 • 36
BLINK: Multimodal Large Language Models Can See but Not Perceive Paper • 2404.12390 • Published Apr 18 • 23
MultiBooth: Towards Generating All Your Concepts in an Image from Text Paper • 2404.14239 • Published Apr 22 • 8
TextSquare: Scaling up Text-Centric Visual Instruction Tuning Paper • 2404.12803 • Published Apr 19 • 28
Groma: Localized Visual Tokenization for Grounding Multimodal Large Language Models Paper • 2404.13013 • Published Apr 19 • 27
CatLIP: CLIP-level Visual Recognition Accuracy with 2.7x Faster Pre-training on Web-scale Image-Text Data Paper • 2404.15653 • Published Apr 24 • 25
SEED-Bench-2-Plus: Benchmarking Multimodal Large Language Models with Text-Rich Visual Comprehension Paper • 2404.16790 • Published Apr 25 • 7
How Far Are We to GPT-4V? Closing the Gap to Commercial Multimodal Models with Open-Source Suites Paper • 2404.16821 • Published Apr 25 • 49
List Items One by One: A New Data Source and Learning Paradigm for Multimodal LLMs Paper • 2404.16375 • Published Apr 25 • 16
PLLaVA : Parameter-free LLaVA Extension from Images to Videos for Video Dense Captioning Paper • 2404.16994 • Published Apr 25 • 33
HaLo-NeRF: Learning Geometry-Guided Semantics for Exploring Unconstrained Photo Collections Paper • 2404.16845 • Published Feb 14 • 6
BlenderAlchemy: Editing 3D Graphics with Vision-Language Models Paper • 2404.17672 • Published Apr 26 • 18
Ag2Manip: Learning Novel Manipulation Skills with Agent-Agnostic Visual and Action Representations Paper • 2404.17521 • Published Apr 26 • 12
Plot2Code: A Comprehensive Benchmark for Evaluating Multi-modal Large Language Models in Code Generation from Scientific Plots Paper • 2405.07990 • Published May 13 • 15
No Time to Waste: Squeeze Time into Channel for Mobile Video Understanding Paper • 2405.08344 • Published May 14 • 10
Understanding the performance gap between online and offline alignment algorithms Paper • 2405.08448 • Published May 14 • 11
SpeechVerse: A Large-scale Generalizable Audio Language Model Paper • 2405.08295 • Published May 14 • 11
SpeechGuard: Exploring the Adversarial Robustness of Multimodal Large Language Models Paper • 2405.08317 • Published May 14 • 8
Xmodel-VLM: A Simple Baseline for Multimodal Vision Language Model Paper • 2405.09215 • Published May 15 • 14
Many-Shot In-Context Learning in Multimodal Foundation Models Paper • 2405.09798 • Published May 16 • 25
Grounding DINO 1.5: Advance the "Edge" of Open-Set Object Detection Paper • 2405.10300 • Published May 16 • 25
Imp: Highly Capable Large Multimodal Models for Mobile Devices Paper • 2405.12107 • Published May 20 • 23
Diffusion for World Modeling: Visual Details Matter in Atari Paper • 2405.12399 • Published May 20 • 25
AlignGPT: Multi-modal Large Language Models with Adaptive Alignment Capability Paper • 2405.14129 • Published May 23 • 9
CamViG: Camera Aware Image-to-Video Generation with Multimodal Transformers Paper • 2405.13195 • Published May 21 • 8
Meteor: Mamba-based Traversal of Rationale for Large Language and Vision Models Paper • 2405.15574 • Published May 24 • 52
Denoising LM: Pushing the Limits of Error Correction Models for Speech Recognition Paper • 2405.15216 • Published May 24 • 11
NV-Embed: Improved Techniques for Training LLMs as Generalist Embedding Models Paper • 2405.17428 • Published May 27 • 14
ConvLLaVA: Hierarchical Backbones as Visual Encoder for Large Multimodal Models Paper • 2405.15738 • Published May 24 • 43
Visual Echoes: A Simple Unified Transformer for Audio-Visual Generation Paper • 2405.14598 • Published May 23 • 11
Zipper: A Multi-Tower Decoder Architecture for Fusing Modalities Paper • 2405.18669 • Published May 29 • 11
MotionLLM: Understanding Human Behaviors from Human Motions and Videos Paper • 2405.20340 • Published May 30 • 19
Video-MME: The First-Ever Comprehensive Evaluation Benchmark of Multi-modal LLMs in Video Analysis Paper • 2405.21075 • Published May 31 • 15
Show, Don't Tell: Aligning Language Models with Demonstrated Feedback Paper • 2406.00888 • Published about 1 month ago • 29
PosterLLaVa: Constructing a Unified Multi-modal Layout Generator with LLM Paper • 2406.02884 • Published 28 days ago • 13
ShareGPT4Video: Improving Video Understanding and Generation with Better Captions Paper • 2406.04325 • Published 26 days ago • 69
AgentGym: Evolving Large Language Model-based Agents across Diverse Environments Paper • 2406.04151 • Published 27 days ago • 14
Mobile-Agent-v2: Mobile Device Operation Assistant with Effective Navigation via Multi-Agent Collaboration Paper • 2406.01014 • Published 30 days ago • 29
An Image is Worth 32 Tokens for Reconstruction and Generation Paper • 2406.07550 • Published 21 days ago • 53
AsyncDiff: Parallelizing Diffusion Models by Asynchronous Denoising Paper • 2406.06911 • Published 22 days ago • 10
VideoLLaMA 2: Advancing Spatial-Temporal Modeling and Audio Understanding in Video-LLMs Paper • 2406.07476 • Published 22 days ago • 30
What If We Recaption Billions of Web Images with LLaMA-3? Paper • 2406.08478 • Published 20 days ago • 38
MMWorld: Towards Multi-discipline Multi-faceted World Model Evaluation in Videos Paper • 2406.08407 • Published 21 days ago • 24
mDPO: Conditional Preference Optimization for Multimodal Large Language Models Paper • 2406.11839 • Published 15 days ago • 36
VideoLLM-online: Online Video Large Language Model for Streaming Video Paper • 2406.11816 • Published 15 days ago • 20
TroL: Traversal of Layers for Large Language and Vision Models Paper • 2406.12246 • Published 15 days ago • 33
VoCo-LLaMA: Towards Vision Compression with Large Language Models Paper • 2406.12275 • Published 15 days ago • 28
Benchmarking Multi-Image Understanding in Vision and Language Models: Perception, Knowledge, Reasoning, and Multi-Hop Reasoning Paper • 2406.12742 • Published 15 days ago • 14
Multimodal Needle in a Haystack: Benchmarking Long-Context Capability of Multimodal Large Language Models Paper • 2406.11230 • Published 16 days ago • 34
Probabilistic Conceptual Explainers: Trustworthy Conceptual Explanations for Vision Foundation Models Paper • 2406.12649 • Published 15 days ago • 15
Understanding Hallucinations in Diffusion Models through Mode Interpolation Paper • 2406.09358 • Published 19 days ago • 4
CMC-Bench: Towards a New Paradigm of Visual Signal Compression Paper • 2406.09356 • Published 20 days ago • 4
4M-21: An Any-to-Any Vision Model for Tens of Tasks and Modalities Paper • 2406.09406 • Published 19 days ago • 12
Visual Sketchpad: Sketching as a Visual Chain of Thought for Multimodal Language Models Paper • 2406.09403 • Published 19 days ago • 17
MuirBench: A Comprehensive Benchmark for Robust Multi-image Understanding Paper • 2406.09411 • Published 19 days ago • 17
mOSCAR: A Large-scale Multilingual and Multimodal Document-level Corpus Paper • 2406.08707 • Published 20 days ago • 14
EMMA: Your Text-to-Image Diffusion Model Can Secretly Accept Multi-Modal Prompts Paper • 2406.09162 • Published 20 days ago • 13
OmniCorpus: A Unified Multimodal Corpus of 10 Billion-Level Images Interleaved with Text Paper • 2406.08418 • Published 21 days ago • 28
GUI Odyssey: A Comprehensive Dataset for Cross-App GUI Navigation on Mobile Devices Paper • 2406.08451 • Published 20 days ago • 23
NaRCan: Natural Refined Canonical Image with Integration of Diffusion Prior for Video Editing Paper • 2406.06523 • Published 22 days ago • 48
Beyond LLaVA-HD: Diving into High-Resolution Large Multimodal Models Paper • 2406.08487 • Published 20 days ago • 10
An Image is Worth More Than 16x16 Patches: Exploring Transformers on Individual Pixels Paper • 2406.09415 • Published 19 days ago • 47
DiTFastAttn: Attention Compression for Diffusion Transformer Models Paper • 2406.08552 • Published 20 days ago • 20
Physics3D: Learning Physical Properties of 3D Gaussians via Video Diffusion Paper • 2406.04338 • Published 26 days ago • 32
Hibou: A Family of Foundational Vision Transformers for Pathology Paper • 2406.05074 • Published 26 days ago • 6
Make It Count: Text-to-Image Generation with an Accurate Number of Objects Paper • 2406.10210 • Published 18 days ago • 74
XLand-100B: A Large-Scale Multi-Task Dataset for In-Context Reinforcement Learning Paper • 2406.08973 • Published 20 days ago • 85
MMDU: A Multi-Turn Multi-Image Dialog Understanding Benchmark and Instruction-Tuning Dataset for LVLMs Paper • 2406.11833 • Published 15 days ago • 60
Exploring the Role of Large Language Models in Prompt Encoding for Diffusion Models Paper • 2406.11831 • Published 15 days ago • 19
From Pixels to Prose: A Large Dataset of Dense Image Captions Paper • 2406.10328 • Published 18 days ago • 16
Prism: A Framework for Decoupling and Assessing the Capabilities of VLMs Paper • 2406.14544 • Published 12 days ago • 33
WildVision: Evaluating Vision-Language Models in the Wild with Human Preferences Paper • 2406.11069 • Published 16 days ago • 11
MINT-1T: Scaling Open-Source Multimodal Data by 10x: A Multimodal Dataset with One Trillion Tokens Paper • 2406.11271 • Published 16 days ago • 10
Unifying Multimodal Retrieval via Document Screenshot Embedding Paper • 2406.11251 • Published 16 days ago • 6
The Devil is in the Details: StyleFeatureEditor for Detail-Rich StyleGAN Inversion and High Quality Image Editing Paper • 2406.10601 • Published 18 days ago • 65
MMBench-Video: A Long-Form Multi-Shot Benchmark for Holistic Video Understanding Paper • 2406.14515 • Published 13 days ago • 27
Two Giraffes in a Dirt Field: Using Game Play to Investigate Situation Modelling in Large Multimodal Models Paper • 2406.14035 • Published 13 days ago • 10
ICAL: Continual Learning of Multimodal Agents by Transforming Trajectories into Actionable Insights Paper • 2406.14596 • Published 12 days ago • 4
Multimodal Structured Generation: CVPR's 2nd MMFM Challenge Technical Report Paper • 2406.11403 • Published 16 days ago • 4
VideoHallucer: Evaluating Intrinsic and Extrinsic Hallucinations in Large Video-Language Models Paper • 2406.16338 • Published 9 days ago • 22
Cambrian-1: A Fully Open, Vision-Centric Exploration of Multimodal LLMs Paper • 2406.16860 • Published 8 days ago • 48
MG-LLaVA: Towards Multi-Granularity Visual Instruction Tuning Paper • 2406.17770 • Published 7 days ago • 18
video-SALMONN: Speech-Enhanced Audio-Visual Large Language Models Paper • 2406.15704 • Published 11 days ago • 5
Octo-planner: On-device Language Model for Planner-Action Agents Paper • 2406.18082 • Published 7 days ago • 45
CharXiv: Charting Gaps in Realistic Chart Understanding in Multimodal LLMs Paper • 2406.18521 • Published 6 days ago • 25
Multimodal Task Vectors Enable Many-Shot Multimodal In-Context Learning Paper • 2406.15334 • Published 11 days ago • 8
Math-LLaVA: Bootstrapping Mathematical Reasoning for Multimodal Large Language Models Paper • 2406.17294 • Published 8 days ago • 8
OMG-LLaVA: Bridging Image-level, Object-level, Pixel-level Reasoning and Understanding Paper • 2406.19389 • Published 5 days ago • 49
Step-DPO: Step-wise Preference Optimization for Long-chain Reasoning of LLMs Paper • 2406.18629 • Published 6 days ago • 36
MUMU: Bootstrapping Multimodal Image Generation from Text-to-Image Data Paper • 2406.18790 • Published 6 days ago • 31
Simulating Classroom Education with LLM-Empowered Agents Paper • 2406.19226 • Published 6 days ago • 27
AUTOHALLUSION: Automatic Generation of Hallucination Benchmarks for Vision-Language Models Paper • 2406.10900 • Published 17 days ago • 11
LLaRA: Supercharging Robot Learning Data for Vision-Language Policy Paper • 2406.20095 • Published 4 days ago • 16
EVF-SAM: Early Vision-Language Fusion for Text-Prompted Segment Anything Model Paper • 2406.20076 • Published 5 days ago • 6
Arboretum: A Large Multimodal Dataset Enabling AI for Biodiversity Paper • 2406.17720 • Published 8 days ago • 6
We-Math: Does Your Large Multimodal Model Achieve Human-like Mathematical Reasoning? Paper • 2407.01284 • Published 1 day ago • 60