matlok
's Collections
Papers - Multimodal
updated
TinyLLaVA: A Framework of Small-scale Large Multimodal Models
Paper
•
2402.14289
•
Published
•
16
ImageBind: One Embedding Space To Bind Them All
Paper
•
2305.05665
•
Published
•
3
DocLLM: A layout-aware generative language model for multimodal document
understanding
Paper
•
2401.00908
•
Published
•
173
Multimodal Contrastive Learning with LIMoE: the Language-Image Mixture
of Experts
Paper
•
2206.02770
•
Published
•
3
Paper
•
2104.03964
•
Published
•
2
MoAI: Mixture of All Intelligence for Large Language and Vision Models
Paper
•
2403.07508
•
Published
•
69
Veagle: Advancements in Multimodal Representation Learning
Paper
•
2403.08773
•
Published
•
7
mPLUG-Owl: Modularization Empowers Large Language Models with
Multimodality
Paper
•
2304.14178
•
Published
•
2
Gemini: A Family of Highly Capable Multimodal Models
Paper
•
2312.11805
•
Published
•
44
Flamingo: a Visual Language Model for Few-Shot Learning
Paper
•
2204.14198
•
Published
•
13
Training Compute-Optimal Large Language Models
Paper
•
2203.15556
•
Published
•
7
Paper
•
2309.16609
•
Published
•
30
AnyGPT: Unified Multimodal LLM with Discrete Sequence Modeling
Paper
•
2402.12226
•
Published
•
37
Unifying Vision, Text, and Layout for Universal Document Processing
Paper
•
2212.02623
•
Published
•
10
Uni-SMART: Universal Science Multimodal Analysis and Research
Transformer
Paper
•
2403.10301
•
Published
•
50
VideoAgent: Long-form Video Understanding with Large Language Model as
Agent
Paper
•
2403.10517
•
Published
•
28
LLaVA-UHD: an LMM Perceiving Any Aspect Ratio and High-Resolution Images
Paper
•
2403.11703
•
Published
•
13
TexDreamer: Towards Zero-Shot High-Fidelity 3D Human Texture Generation
Paper
•
2403.12906
•
Published
•
4
HyperLLaVA: Dynamic Visual and Language Expert Tuning for Multimodal
Large Language Models
Paper
•
2403.13447
•
Published
•
16
MathVerse: Does Your Multi-modal LLM Truly See the Diagrams in Visual
Math Problems?
Paper
•
2403.14624
•
Published
•
49
A Multimodal Approach to Device-Directed Speech Detection with Large
Language Models
Paper
•
2403.14438
•
Published
•
2
InternVideo2: Scaling Video Foundation Models for Multimodal Video
Understanding
Paper
•
2403.15377
•
Published
•
16
Mini-Gemini: Mining the Potential of Multi-modality Vision Language
Models
Paper
•
2403.18814
•
Published
•
37
FormNetV2: Multimodal Graph Contrastive Learning for Form Document
Information Extraction
Paper
•
2305.02549
•
Published
•
5
Qwen-VL: A Frontier Large Vision-Language Model with Versatile Abilities
Paper
•
2308.12966
•
Published
•
6
LVLM-Intrepret: An Interpretability Tool for Large Vision-Language
Models
Paper
•
2404.03118
•
Published
•
17
No "Zero-Shot" Without Exponential Data: Pretraining Concept Frequency
Determines Multimodal Model Performance
Paper
•
2404.04125
•
Published
•
25
Ferret-UI: Grounded Mobile UI Understanding with Multimodal LLMs
Paper
•
2404.05719
•
Published
•
56
Ferret-v2: An Improved Baseline for Referring and Grounding with Large
Language Models
Paper
•
2404.07973
•
Published
•
28
BLINK: Multimodal Large Language Models Can See but Not Perceive
Paper
•
2404.12390
•
Published
•
23