btjhjeon
's Collections
Multimodal LLM
updated
DocLLM: A layout-aware generative language model for multimodal document
understanding
Paper
•
2401.00908
•
Published
•
178
COSMO: COntrastive Streamlined MultimOdal Model with Interleaved
Pre-Training
Paper
•
2401.00849
•
Published
•
14
LLaVA-Plus: Learning to Use Tools for Creating Multimodal Agents
Paper
•
2311.05437
•
Published
•
42
LLaVA-Interactive: An All-in-One Demo for Image Chat, Segmentation,
Generation and Editing
Paper
•
2311.00571
•
Published
•
40
LLaVA-φ: Efficient Multi-Modal Assistant with Small Language Model
Paper
•
2401.02330
•
Published
•
14
Unified-IO 2: Scaling Autoregressive Multimodal Models with Vision,
Language, Audio, and Action
Paper
•
2312.17172
•
Published
•
26
Unified-IO: A Unified Model for Vision, Language, and Multi-Modal Tasks
Paper
•
2206.08916
•
Published
ImageBind: One Embedding Space To Bind Them All
Paper
•
2305.05665
•
Published
•
3
Distilling Vision-Language Models on Millions of Videos
Paper
•
2401.06129
•
Published
•
14
LEGO:Language Enhanced Multi-modal Grounding Model
Paper
•
2401.06071
•
Published
•
10
Improving fine-grained understanding in image-text pre-training
Paper
•
2401.09865
•
Published
•
15
SPHINX-X: Scaling Data and Parameters for a Family of Multi-modal Large
Language Models
Paper
•
2402.05935
•
Published
•
15
ViGoR: Improving Visual Grounding of Large Vision Language Models with
Fine-Grained Reward Modeling
Paper
•
2402.06118
•
Published
•
13
Prismatic VLMs: Investigating the Design Space of Visually-Conditioned
Language Models
Paper
•
2402.07865
•
Published
•
12
BBA: Bi-Modal Behavioral Alignment for Reasoning with Large
Vision-Language Models
Paper
•
2402.13577
•
Published
•
7
A Touch, Vision, and Language Dataset for Multimodal Alignment
Paper
•
2402.13232
•
Published
•
13
TinyLLaVA: A Framework of Small-scale Large Multimodal Models
Paper
•
2402.14289
•
Published
•
19
Enhancing Vision-Language Pre-training with Rich Supervisions
Paper
•
2403.03346
•
Published
•
14
LLaVA-UHD: an LMM Perceiving Any Aspect Ratio and High-Resolution Images
Paper
•
2403.11703
•
Published
•
16
HyperLLaVA: Dynamic Visual and Language Expert Tuning for Multimodal
Large Language Models
Paper
•
2403.13447
•
Published
•
17
When Do We Not Need Larger Vision Models?
Paper
•
2403.13043
•
Published
•
25
Cobra: Extending Mamba to Multi-Modal Large Language Model for Efficient
Inference
Paper
•
2403.14520
•
Published
•
32
MathVerse: Does Your Multi-modal LLM Truly See the Diagrams in Visual
Math Problems?
Paper
•
2403.14624
•
Published
•
50
MoAI: Mixture of All Intelligence for Large Language and Vision Models
Paper
•
2403.07508
•
Published
•
75
Mini-Gemini: Mining the Potential of Multi-modality Vision Language
Models
Paper
•
2403.18814
•
Published
•
44
InternLM-XComposer2-4KHD: A Pioneering Large Vision-Language Model
Handling Resolutions from 336 Pixels to 4K HD
Paper
•
2404.06512
•
Published
•
29
OmniFusion Technical Report
Paper
•
2404.06212
•
Published
•
73
BLINK: Multimodal Large Language Models Can See but Not Perceive
Paper
•
2404.12390
•
Published
•
24
Reka Core, Flash, and Edge: A Series of Powerful Multimodal Language
Models
Paper
•
2404.12387
•
Published
•
38
SEED-X: Multimodal Models with Unified Multi-granularity Comprehension
and Generation
Paper
•
2404.14396
•
Published
•
18
How Far Are We to GPT-4V? Closing the Gap to Commercial Multimodal
Models with Open-Source Suites
Paper
•
2404.16821
•
Published
•
53
SEED-Bench-2-Plus: Benchmarking Multimodal Large Language Models with
Text-Rich Visual Comprehension
Paper
•
2404.16790
•
Published
•
7
PLLaVA : Parameter-free LLaVA Extension from Images to Videos for Video
Dense Captioning
Paper
•
2404.16994
•
Published
•
35
What matters when building vision-language models?
Paper
•
2405.02246
•
Published
•
97
An Introduction to Vision-Language Modeling
Paper
•
2405.17247
•
Published
•
84
Matryoshka Multimodal Models
Paper
•
2405.17430
•
Published
•
30
ConvLLaVA: Hierarchical Backbones as Visual Encoder for Large Multimodal
Models
Paper
•
2405.15738
•
Published
•
43
Needle In A Multimodal Haystack
Paper
•
2406.07230
•
Published
•
52
ChartMimic: Evaluating LMM's Cross-Modal Reasoning Capability via
Chart-to-Code Generation
Paper
•
2406.09961
•
Published
•
54
OmniCorpus: A Unified Multimodal Corpus of 10 Billion-Level Images
Interleaved with Text
Paper
•
2406.08418
•
Published
•
28
Visual Sketchpad: Sketching as a Visual Chain of Thought for Multimodal
Language Models
Paper
•
2406.09403
•
Published
•
19
mOSCAR: A Large-scale Multilingual and Multimodal Document-level Corpus
Paper
•
2406.08707
•
Published
•
15
CVQA: Culturally-diverse Multilingual Visual Question Answering
Benchmark
Paper
•
2406.05967
•
Published
•
5
MMDU: A Multi-Turn Multi-Image Dialog Understanding Benchmark and
Instruction-Tuning Dataset for LVLMs
Paper
•
2406.11833
•
Published
•
61
mDPO: Conditional Preference Optimization for Multimodal Large Language
Models
Paper
•
2406.11839
•
Published
•
36
MINT-1T: Scaling Open-Source Multimodal Data by 10x: A Multimodal
Dataset with One Trillion Tokens
Paper
•
2406.11271
•
Published
•
18
TokenPacker: Efficient Visual Projector for Multimodal LLM
Paper
•
2407.02392
•
Published
•
21
Understanding Alignment in Multimodal LLMs: A Comprehensive Study
Paper
•
2407.02477
•
Published
•
21
Vision language models are blind
Paper
•
2407.06581
•
Published
•
80
Cambrian-1: A Fully Open, Vision-Centric Exploration of Multimodal LLMs
Paper
•
2406.16860
•
Published
•
54
ANOLE: An Open, Autoregressive, Native Large Multimodal Models for
Interleaved Image-Text Generation
Paper
•
2407.06135
•
Published
•
20
MAVIS: Mathematical Visual Instruction Tuning
Paper
•
2407.08739
•
Published
•
30
Multimodal Self-Instruct: Synthetic Abstract Image and Visual Reasoning
Instruction Using Language Model
Paper
•
2407.07053
•
Published
•
41
LLaVA-NeXT-Interleave: Tackling Multi-image, Video, and 3D in Large
Multimodal Models
Paper
•
2407.07895
•
Published
•
40
SPIQA: A Dataset for Multimodal Question Answering on Scientific Papers
Paper
•
2407.09413
•
Published
•
9
INF-LLaVA: Dual-perspective Perception for High-Resolution Multimodal
Large Language Model
Paper
•
2407.16198
•
Published
•
13
VILA^2: VILA Augmented VILA
Paper
•
2407.17453
•
Published
•
38
MiniCPM-V: A GPT-4V Level MLLM on Your Phone
Paper
•
2408.01800
•
Published
•
74
VITA: Towards Open-Source Interactive Omni Multimodal LLM
Paper
•
2408.05211
•
Published
•
46
mPLUG-Owl3: Towards Long Image-Sequence Understanding in Multi-Modal
Large Language Models
Paper
•
2408.04840
•
Published
•
30
xGen-MM (BLIP-3): A Family of Open Large Multimodal Models
Paper
•
2408.08872
•
Published
•
96
LongVILA: Scaling Long-Context Visual Language Models for Long Videos
Paper
•
2408.10188
•
Published
•
51
Show-o: One Single Transformer to Unify Multimodal Understanding and
Generation
Paper
•
2408.12528
•
Published
•
50
Open-FinLLMs: Open Multimodal Large Language Models for Financial
Applications
Paper
•
2408.11878
•
Published
•
48
Building and better understanding vision-language models: insights and
future directions
Paper
•
2408.12637
•
Published
•
109
Eagle: Exploring The Design Space for Multimodal LLMs with Mixture of
Encoders
Paper
•
2408.15998
•
Published
•
80
CogVLM2: Visual Language Models for Image and Video Understanding
Paper
•
2408.16500
•
Published
•
55
Law of Vision Representation in MLLMs
Paper
•
2408.16357
•
Published
•
91
LLaVA-MoD: Making LLaVA Tiny via MoE Knowledge Distillation
Paper
•
2408.15881
•
Published
•
19
LongLLaVA: Scaling Multi-modal LLMs to 1000 Images Efficiently via
Hybrid Architecture
Paper
•
2409.02889
•
Published
•
53
mPLUG-DocOwl2: High-resolution Compressing for OCR-free Multi-page
Document Understanding
Paper
•
2409.03420
•
Published
•
23