btjhjeon
's Collections
Multimodal LLM
updated
DocLLM: A layout-aware generative language model for multimodal document
understanding
Paper
•
2401.00908
•
Published
•
173
COSMO: COntrastive Streamlined MultimOdal Model with Interleaved
Pre-Training
Paper
•
2401.00849
•
Published
•
14
LLaVA-Plus: Learning to Use Tools for Creating Multimodal Agents
Paper
•
2311.05437
•
Published
•
40
LLaVA-Interactive: An All-in-One Demo for Image Chat, Segmentation,
Generation and Editing
Paper
•
2311.00571
•
Published
•
39
LLaVA-φ: Efficient Multi-Modal Assistant with Small Language Model
Paper
•
2401.02330
•
Published
•
11
Unified-IO 2: Scaling Autoregressive Multimodal Models with Vision,
Language, Audio, and Action
Paper
•
2312.17172
•
Published
•
24
Unified-IO: A Unified Model for Vision, Language, and Multi-Modal Tasks
Paper
•
2206.08916
•
Published
ImageBind: One Embedding Space To Bind Them All
Paper
•
2305.05665
•
Published
•
3
Distilling Vision-Language Models on Millions of Videos
Paper
•
2401.06129
•
Published
•
13
LEGO:Language Enhanced Multi-modal Grounding Model
Paper
•
2401.06071
•
Published
•
10
Improving fine-grained understanding in image-text pre-training
Paper
•
2401.09865
•
Published
•
12
SPHINX-X: Scaling Data and Parameters for a Family of Multi-modal Large
Language Models
Paper
•
2402.05935
•
Published
•
12
ViGoR: Improving Visual Grounding of Large Vision Language Models with
Fine-Grained Reward Modeling
Paper
•
2402.06118
•
Published
•
12
Prismatic VLMs: Investigating the Design Space of Visually-Conditioned
Language Models
Paper
•
2402.07865
•
Published
•
11
BBA: Bi-Modal Behavioral Alignment for Reasoning with Large
Vision-Language Models
Paper
•
2402.13577
•
Published
•
5
A Touch, Vision, and Language Dataset for Multimodal Alignment
Paper
•
2402.13232
•
Published
•
11
TinyLLaVA: A Framework of Small-scale Large Multimodal Models
Paper
•
2402.14289
•
Published
•
16
Enhancing Vision-Language Pre-training with Rich Supervisions
Paper
•
2403.03346
•
Published
•
12
LLaVA-UHD: an LMM Perceiving Any Aspect Ratio and High-Resolution Images
Paper
•
2403.11703
•
Published
•
13
HyperLLaVA: Dynamic Visual and Language Expert Tuning for Multimodal
Large Language Models
Paper
•
2403.13447
•
Published
•
16
When Do We Not Need Larger Vision Models?
Paper
•
2403.13043
•
Published
•
23
Cobra: Extending Mamba to Multi-Modal Large Language Model for Efficient
Inference
Paper
•
2403.14520
•
Published
•
31
MathVerse: Does Your Multi-modal LLM Truly See the Diagrams in Visual
Math Problems?
Paper
•
2403.14624
•
Published
•
50
MoAI: Mixture of All Intelligence for Large Language and Vision Models
Paper
•
2403.07508
•
Published
•
69
Mini-Gemini: Mining the Potential of Multi-modality Vision Language
Models
Paper
•
2403.18814
•
Published
•
37
InternLM-XComposer2-4KHD: A Pioneering Large Vision-Language Model
Handling Resolutions from 336 Pixels to 4K HD
Paper
•
2404.06512
•
Published
•
29
OmniFusion Technical Report
Paper
•
2404.06212
•
Published
•
73
BLINK: Multimodal Large Language Models Can See but Not Perceive
Paper
•
2404.12390
•
Published
•
23
Reka Core, Flash, and Edge: A Series of Powerful Multimodal Language
Models
Paper
•
2404.12387
•
Published
•
34
SEED-X: Multimodal Models with Unified Multi-granularity Comprehension
and Generation
Paper
•
2404.14396
•
Published
•
16
How Far Are We to GPT-4V? Closing the Gap to Commercial Multimodal
Models with Open-Source Suites
Paper
•
2404.16821
•
Published
•
48
SEED-Bench-2-Plus: Benchmarking Multimodal Large Language Models with
Text-Rich Visual Comprehension
Paper
•
2404.16790
•
Published
•
7
PLLaVA : Parameter-free LLaVA Extension from Images to Videos for Video
Dense Captioning
Paper
•
2404.16994
•
Published
•
30