zzfive
's Collections
multimodal
updated
iVideoGPT: Interactive VideoGPTs are Scalable World Models
Paper
•
2405.15223
•
Published
•
12
Meteor: Mamba-based Traversal of Rationale for Large Language and Vision
Models
Paper
•
2405.15574
•
Published
•
53
An Introduction to Vision-Language Modeling
Paper
•
2405.17247
•
Published
•
85
Matryoshka Multimodal Models
Paper
•
2405.17430
•
Published
•
31
Zipper: A Multi-Tower Decoder Architecture for Fusing Modalities
Paper
•
2405.18669
•
Published
•
11
MotionLLM: Understanding Human Behaviors from Human Motions and Videos
Paper
•
2405.20340
•
Published
•
19
Parrot: Multilingual Visual Instruction Tuning
Paper
•
2406.02539
•
Published
•
35
PosterLLaVa: Constructing a Unified Multi-modal Layout Generator with
LLM
Paper
•
2406.02884
•
Published
•
15
What If We Recaption Billions of Web Images with LLaMA-3?
Paper
•
2406.08478
•
Published
•
39
VideoLLaMA 2: Advancing Spatial-Temporal Modeling and Audio
Understanding in Video-LLMs
Paper
•
2406.07476
•
Published
•
32
MMWorld: Towards Multi-discipline Multi-faceted World Model Evaluation
in Videos
Paper
•
2406.08407
•
Published
•
24
AV-DiT: Efficient Audio-Visual Diffusion Transformer for Joint Audio and
Video Generation
Paper
•
2406.07686
•
Published
•
14
OpenVLA: An Open-Source Vision-Language-Action Model
Paper
•
2406.09246
•
Published
•
36
Visual Sketchpad: Sketching as a Visual Chain of Thought for Multimodal
Language Models
Paper
•
2406.09403
•
Published
•
19
Explore the Limits of Omni-modal Pretraining at Scale
Paper
•
2406.09412
•
Published
•
10
4M-21: An Any-to-Any Vision Model for Tens of Tasks and Modalities
Paper
•
2406.09406
•
Published
•
13
ChartMimic: Evaluating LMM's Cross-Modal Reasoning Capability via
Chart-to-Code Generation
Paper
•
2406.09961
•
Published
•
54
Needle In A Multimodal Haystack
Paper
•
2406.07230
•
Published
•
52
OmniCorpus: A Unified Multimodal Corpus of 10 Billion-Level Images
Interleaved with Text
Paper
•
2406.08418
•
Published
•
28
mDPO: Conditional Preference Optimization for Multimodal Large Language
Models
Paper
•
2406.11839
•
Published
•
37
LLaNA: Large Language and NeRF Assistant
Paper
•
2406.11840
•
Published
•
17
Prism: A Framework for Decoupling and Assessing the Capabilities of VLMs
Paper
•
2406.14544
•
Published
•
34
PIN: A Knowledge-Intensive Dataset for Paired and Interleaved Multimodal
Documents
Paper
•
2406.13923
•
Published
•
21
Improving Visual Commonsense in Language Models via Multiple Image
Generation
Paper
•
2406.13621
•
Published
•
13
Multimodal Structured Generation: CVPR's 2nd MMFM Challenge Technical
Report
Paper
•
2406.11403
•
Published
•
4
Towards Fast Multilingual LLM Inference: Speculative Decoding and
Specialized Drafters
Paper
•
2406.16758
•
Published
•
19
Cambrian-1: A Fully Open, Vision-Centric Exploration of Multimodal LLMs
Paper
•
2406.16860
•
Published
•
57
Long Context Transfer from Language to Vision
Paper
•
2406.16852
•
Published
•
32
video-SALMONN: Speech-Enhanced Audio-Visual Large Language Models
Paper
•
2406.15704
•
Published
•
5
HuatuoGPT-Vision, Towards Injecting Medical Visual Knowledge into
Multimodal LLMs at Scale
Paper
•
2406.19280
•
Published
•
60
Arboretum: A Large Multimodal Dataset Enabling AI for Biodiversity
Paper
•
2406.17720
•
Published
•
7
OmniJARVIS: Unified Vision-Language-Action Tokenization Enables
Open-World Instruction Following Agents
Paper
•
2407.00114
•
Published
•
12
Understanding Alignment in Multimodal LLMs: A Comprehensive Study
Paper
•
2407.02477
•
Published
•
21
InternLM-XComposer-2.5: A Versatile Large Vision Language Model
Supporting Long-Contextual Input and Output
Paper
•
2407.03320
•
Published
•
92
TokenPacker: Efficient Visual Projector for Multimodal LLM
Paper
•
2407.02392
•
Published
•
21
Unveiling Encoder-Free Vision-Language Models
Paper
•
2406.11832
•
Published
•
49
RULE: Reliable Multimodal RAG for Factuality in Medical Vision Language
Models
Paper
•
2407.05131
•
Published
•
24
ChartGemma: Visual Instruction-tuning for Chart Reasoning in the Wild
Paper
•
2407.04172
•
Published
•
22
Stark: Social Long-Term Multi-Modal Conversation with Persona
Commonsense Knowledge
Paper
•
2407.03958
•
Published
•
18
HEMM: Holistic Evaluation of Multimodal Foundation Models
Paper
•
2407.03418
•
Published
•
8
ANOLE: An Open, Autoregressive, Native Large Multimodal Models for
Interleaved Image-Text Generation
Paper
•
2407.06135
•
Published
•
20
Vision language models are blind
Paper
•
2407.06581
•
Published
•
82
Video-STaR: Self-Training Enables Video Instruction Tuning with Any
Supervision
Paper
•
2407.06189
•
Published
•
24
LLaVA-NeXT-Interleave: Tackling Multi-image, Video, and 3D in Large
Multimodal Models
Paper
•
2407.07895
•
Published
•
40
PaliGemma: A versatile 3B VLM for transfer
Paper
•
2407.07726
•
Published
•
67
FIRE: A Dataset for Feedback Integration and Refinement Evaluation of
Multimodal Models
Paper
•
2407.11522
•
Published
•
8
VLMEvalKit: An Open-Source Toolkit for Evaluating Large Multi-Modality
Models
Paper
•
2407.11691
•
Published
•
13
OmniBind: Large-scale Omni Multimodal Representation via Binding Spaces
Paper
•
2407.11895
•
Published
•
7
Data-Juicer Sandbox: A Comprehensive Suite for Multimodal Data-Model
Co-development
Paper
•
2407.11784
•
Published
•
4
E5-V: Universal Embeddings with Multimodal Large Language Models
Paper
•
2407.12580
•
Published
•
39
LMMs-Eval: Reality Check on the Evaluation of Large Multimodal Models
Paper
•
2407.12772
•
Published
•
33
Goldfish: Vision-Language Understanding of Arbitrarily Long Videos
Paper
•
2407.12679
•
Published
•
7
EVLM: An Efficient Vision-Language Model for Visual Understanding
Paper
•
2407.14177
•
Published
•
42
SlowFast-LLaVA: A Strong Training-Free Baseline for Video Large Language
Models
Paper
•
2407.15841
•
Published
•
39
VideoGameBunny: Towards vision assistants for video games
Paper
•
2407.15295
•
Published
•
21
MIBench: Evaluating Multimodal Large Language Models over Multiple
Images
Paper
•
2407.15272
•
Published
•
10
Visual Haystacks: Answering Harder Questions About Sets of Images
Paper
•
2407.13766
•
Published
•
2
INF-LLaVA: Dual-perspective Perception for High-Resolution Multimodal
Large Language Model
Paper
•
2407.16198
•
Published
•
13
VILA^2: VILA Augmented VILA
Paper
•
2407.17453
•
Published
•
39
Efficient Inference of Vision Instruction-Following Models with Elastic
Cache
Paper
•
2407.18121
•
Published
•
16
Wolf: Captioning Everything with a World Summarization Framework
Paper
•
2407.18908
•
Published
•
31
MoMa: Efficient Early-Fusion Pre-training with Mixture of Modality-Aware
Experts
Paper
•
2407.21770
•
Published
•
22
OmniParser for Pure Vision Based GUI Agent
Paper
•
2408.00203
•
Published
•
23
MiniCPM-V: A GPT-4V Level MLLM on Your Phone
Paper
•
2408.01800
•
Published
•
78
Language Model Can Listen While Speaking
Paper
•
2408.02622
•
Published
•
37
ExoViP: Step-by-step Verification and Exploration with Exoskeleton
Modules for Compositional Visual Reasoning
Paper
•
2408.02210
•
Published
•
7
MMIU: Multimodal Multi-image Understanding for Evaluating Large
Vision-Language Models
Paper
•
2408.02718
•
Published
•
60
VITA: Towards Open-Source Interactive Omni Multimodal LLM
Paper
•
2408.05211
•
Published
•
46
VisualAgentBench: Towards Large Multimodal Models as Visual Foundation
Agents
Paper
•
2408.06327
•
Published
•
15
xGen-MM (BLIP-3): A Family of Open Large Multimodal Models
Paper
•
2408.08872
•
Published
•
97
LongVILA: Scaling Long-Context Visual Language Models for Long Videos
Paper
•
2408.10188
•
Published
•
51
Segment Anything with Multiple Modalities
Paper
•
2408.09085
•
Published
•
21
Transfusion: Predict the Next Token and Diffuse Images with One
Multi-Modal Model
Paper
•
2408.11039
•
Published
•
56
GRAB: A Challenging GRaph Analysis Benchmark for Large Multimodal Models
Paper
•
2408.11817
•
Published
•
8
Show-o: One Single Transformer to Unify Multimodal Understanding and
Generation
Paper
•
2408.12528
•
Published
•
50
SPARK: Multi-Vision Sensor Perception and Reasoning Benchmark for
Large-scale Vision-Language Models
Paper
•
2408.12114
•
Published
•
12
SEA: Supervised Embedding Alignment for Token-Level Visual-Textual
Integration in MLLMs
Paper
•
2408.11813
•
Published
•
11
Building and better understanding vision-language models: insights and
future directions
Paper
•
2408.12637
•
Published
•
118
CogVLM2: Visual Language Models for Image and Video Understanding
Paper
•
2408.16500
•
Published
•
56
LLaVA-MoD: Making LLaVA Tiny via MoE Knowledge Distillation
Paper
•
2408.15881
•
Published
•
20
Law of Vision Representation in MLLMs
Paper
•
2408.16357
•
Published
•
92
UrBench: A Comprehensive Benchmark for Evaluating Large Multimodal
Models in Multi-View Urban Scenarios
Paper
•
2408.17267
•
Published
•
23
VLM4Bio: A Benchmark Dataset to Evaluate Pretrained Vision-Language
Models for Trait Discovery from Biological Images
Paper
•
2408.16176
•
Published
•
7
Mini-Omni: Language Models Can Hear, Talk While Thinking in Streaming
Paper
•
2408.16725
•
Published
•
52
VideoLLaMB: Long-context Video Understanding with Recurrent Memory
Bridges
Paper
•
2409.01071
•
Published
•
26
LongLLaVA: Scaling Multi-modal LLMs to 1000 Images Efficiently via
Hybrid Architecture
Paper
•
2409.02889
•
Published
•
54
MMMU-Pro: A More Robust Multi-discipline Multimodal Understanding
Benchmark
Paper
•
2409.02813
•
Published
•
28
mPLUG-DocOwl2: High-resolution Compressing for OCR-free Multi-page
Document Understanding
Paper
•
2409.03420
•
Published
•
25
MMEvol: Empowering Multimodal Large Language Models with Evol-Instruct
Paper
•
2409.05840
•
Published
•
45
POINTS: Improving Your Vision-language Model with Affordable Strategies
Paper
•
2409.04828
•
Published
•
22
LLaMA-Omni: Seamless Speech Interaction with Large Language Models
Paper
•
2409.06666
•
Published
•
55
Guiding Vision-Language Model Selection for Visual Question-Answering
Across Tasks, Domains, and Knowledge Types
Paper
•
2409.09269
•
Published
•
7
NVLM: Open Frontier-Class Multimodal LLMs
Paper
•
2409.11402
•
Published
•
72
Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at
Any Resolution
Paper
•
2409.12191
•
Published
•
74
Putting Data at the Centre of Offline Multi-Agent Reinforcement Learning
Paper
•
2409.12001
•
Published
•
3
MonoFormer: One Transformer for Both Diffusion and Autoregression
Paper
•
2409.16280
•
Published
•
17
Molmo and PixMo: Open Weights and Open Data for State-of-the-Art
Multimodal Models
Paper
•
2409.17146
•
Published
•
103
EMOVA: Empowering Language Models to See, Hear and Speak with Vivid
Emotions
Paper
•
2409.18042
•
Published
•
36
Emu3: Next-Token Prediction is All You Need
Paper
•
2409.18869
•
Published
•
91
MIO: A Foundation Model on Multimodal Tokens
Paper
•
2409.17692
•
Published
•
50
UniMuMo: Unified Text, Music and Motion Generation
Paper
•
2410.04534
•
Published
•
18
NL-Eye: Abductive NLI for Images
Paper
•
2410.02613
•
Published
•
22
Paper
•
2410.07073
•
Published
•
60
Personalized Visual Instruction Tuning
Paper
•
2410.07113
•
Published
•
69
Deciphering Cross-Modal Alignment in Large Vision-Language Models with
Modality Integration Rate
Paper
•
2410.07167
•
Published
•
37
Aria: An Open Multimodal Native Mixture-of-Experts Model
Paper
•
2410.05993
•
Published
•
107
Multimodal Situational Safety
Paper
•
2410.06172
•
Published
•
8
Revisit Large-Scale Image-Caption Data in Pre-training Multimodal
Foundation Models
Paper
•
2410.02740
•
Published
•
52
Video Instruction Tuning With Synthetic Data
Paper
•
2410.02713
•
Published
•
37
LLaVA-Critic: Learning to Evaluate Multimodal Models
Paper
•
2410.02712
•
Published
•
34
Distilling an End-to-End Voice Assistant Without Instruction Training
Data
Paper
•
2410.02678
•
Published
•
22
MLLM as Retriever: Interactively Learning Multimodal Retrieval for
Embodied Agents
Paper
•
2410.03450
•
Published
•
36
Baichuan-Omni Technical Report
Paper
•
2410.08565
•
Published
•
84
From Generalist to Specialist: Adapting Vision Language Models via
Task-Specific Visual Instruction Tuning
Paper
•
2410.06456
•
Published
•
35
LOKI: A Comprehensive Synthetic Data Detection Benchmark using Large
Multimodal Models
Paper
•
2410.09732
•
Published
•
54
MMIE: Massive Multimodal Interleaved Comprehension Benchmark for Large
Vision-Language Models
Paper
•
2410.10139
•
Published
•
50
MEGA-Bench: Scaling Multimodal Evaluation to over 500 Real-World Tasks
Paper
•
2410.10563
•
Published
•
37
VisRAG: Vision-based Retrieval-augmented Generation on Multi-modality
Documents
Paper
•
2410.10594
•
Published
•
22
TemporalBench: Benchmarking Fine-grained Temporal Understanding for
Multimodal Video Models
Paper
•
2410.10818
•
Published
•
14
TVBench: Redesigning Video-Language Evaluation
Paper
•
2410.07752
•
Published
•
5
The Curse of Multi-Modalities: Evaluating Hallucinations of Large
Multimodal Models across Language, Visual, and Audio
Paper
•
2410.12787
•
Published
•
30
MMed-RAG: Versatile Multimodal RAG System for Medical Vision Language
Models
Paper
•
2410.13085
•
Published
•
20
Janus: Decoupling Visual Encoding for Unified Multimodal Understanding
and Generation
Paper
•
2410.13848
•
Published
•
27
MixEval-X: Any-to-Any Evaluations from Real-World Data Mixtures
Paper
•
2410.13754
•
Published
•
74
WorldCuisines: A Massive-Scale Benchmark for Multilingual and
Multicultural Visual Question Answering on Global Cuisines
Paper
•
2410.12705
•
Published
•
29
Remember, Retrieve and Generate: Understanding Infinite Visual Concepts
as Your Personalized Assistant
Paper
•
2410.13360
•
Published
•
8
γ-MoD: Exploring Mixture-of-Depth Adaptation for Multimodal Large
Language Models
Paper
•
2410.13859
•
Published
•
7
NaturalBench: Evaluating Vision-Language Models on Natural Adversarial
Samples
Paper
•
2410.14669
•
Published
•
35
Mini-Omni2: Towards Open-source GPT-4o with Vision, Speech and Duplex
Capabilities
Paper
•
2410.11190
•
Published
•
20
PUMA: Empowering Unified MLLM with Multi-granular Visual Generation
Paper
•
2410.13861
•
Published
•
53
Pangea: A Fully Open Multilingual Multimodal LLM for 39 Languages
Paper
•
2410.16153
•
Published
•
42
Improve Vision Language Model Chain-of-thought Reasoning
Paper
•
2410.16198
•
Published
•
17
Mitigating Object Hallucination via Concentric Causal Attention
Paper
•
2410.15926
•
Published
•
14
MIA-DPO: Multi-Image Augmented Direct Preference Optimization For Large
Vision-Language Models
Paper
•
2410.17637
•
Published
•
34
Can Knowledge Editing Really Correct Hallucinations?
Paper
•
2410.16251
•
Published
•
54
Unbounded: A Generative Infinite Game of Character Life Simulation
Paper
•
2410.18975
•
Published
•
34
WAFFLE: Multi-Modal Model for Automated Front-End Development
Paper
•
2410.18362
•
Published
•
11
ROCKET-1: Master Open-World Interaction with Visual-Temporal Context
Prompting
Paper
•
2410.17856
•
Published
•
49
Infinity-MM: Scaling Multimodal Performance with Large-Scale and
High-Quality Instruction Data
Paper
•
2410.18558
•
Published
•
18
Paper
•
2410.21276
•
Published
•
79
Vision Search Assistant: Empower Vision-Language Models as Multimodal
Search Engines
Paper
•
2410.21220
•
Published
•
8
VideoWebArena: Evaluating Long Context Multimodal Agents with Video
Understanding Web Tasks
Paper
•
2410.19100
•
Published
•
6
OS-ATLAS: A Foundation Action Model for Generalist GUI Agents
Paper
•
2410.23218
•
Published
•
46
TOMATO: Assessing Visual Temporal Reasoning Capabilities in Multimodal
Foundation Models
Paper
•
2410.23266
•
Published
•
19
DynaMath: A Dynamic Visual Benchmark for Evaluating Mathematical
Reasoning Robustness of Vision Language Models
Paper
•
2411.00836
•
Published
•
15
PPLLaVA: Varied Video Sequence Understanding With Prompt Guidance
Paper
•
2411.02327
•
Published
•
11
Mixture-of-Transformers: A Sparse and Scalable Architecture for
Multi-Modal Foundation Models
Paper
•
2411.04996
•
Published
•
48
VideoGLaMM: A Large Multimodal Model for Pixel-Level Visual Grounding in
Videos
Paper
•
2411.04923
•
Published
•
20
Analyzing The Language of Visual Tokens
Paper
•
2411.05001
•
Published
•
20
M-Longdoc: A Benchmark For Multimodal Super-Long Document Understanding
And A Retrieval-Aware Tuning Framework
Paper
•
2411.06176
•
Published
•
44
LLaVA-o1: Let Vision Language Models Reason Step-by-Step
Paper
•
2411.10440
•
Published
•
89
Generative World Explorer
Paper
•
2411.11844
•
Published
•
55
BlueLM-V-3B: Algorithm and System Co-Design for Multimodal Large
Language Models on Mobile Devices
Paper
•
2411.10640
•
Published
•
37
Awaker2.5-VL: Stably Scaling MLLMs with Parameter-Efficient Mixture of
Experts
Paper
•
2411.10669
•
Published
•
9
VideoAutoArena: An Automated Arena for Evaluating Large Multimodal
Models in Video Analysis through User Simulation
Paper
•
2411.13281
•
Published
•
14