plmsmile
's Collections
image llm
updated
Visual Fact Checker: Enabling High-Fidelity Detailed Caption Generation
Paper
•
2404.19752
•
Published
•
24
How Far Are We to GPT-4V? Closing the Gap to Commercial Multimodal
Models with Open-Source Suites
Paper
•
2404.16821
•
Published
•
56
MoAI: Mixture of All Intelligence for Large Language and Vision Models
Paper
•
2403.07508
•
Published
•
75
MM1: Methods, Analysis & Insights from Multimodal LLM Pre-training
Paper
•
2403.09611
•
Published
•
126
HyperLLaVA: Dynamic Visual and Language Expert Tuning for Multimodal
Large Language Models
Paper
•
2403.13447
•
Published
•
18
MyVLM: Personalizing VLMs for User-Specific Queries
Paper
•
2403.14599
•
Published
•
16
Mini-Gemini: Mining the Potential of Multi-modality Vision Language
Models
Paper
•
2403.18814
•
Published
•
46
LLaVA-Gemma: Accelerating Multimodal Foundation Models with a Compact
Language Model
Paper
•
2404.01331
•
Published
•
25
OmniFusion Technical Report
Paper
•
2404.06212
•
Published
•
75
InternLM-XComposer2-4KHD: A Pioneering Large Vision-Language Model
Handling Resolutions from 336 Pixels to 4K HD
Paper
•
2404.06512
•
Published
•
30
TextSquare: Scaling up Text-Centric Visual Instruction Tuning
Paper
•
2404.12803
•
Published
•
30
Groma: Localized Visual Tokenization for Grounding Multimodal Large
Language Models
Paper
•
2404.13013
•
Published
•
31
What matters when building vision-language models?
Paper
•
2405.02246
•
Published
•
102
Xmodel-VLM: A Simple Baseline for Multimodal Vision Language Model
Paper
•
2405.09215
•
Published
•
20
An Introduction to Vision-Language Modeling
Paper
•
2405.17247
•
Published
•
87
Parrot: Multilingual Visual Instruction Tuning
Paper
•
2406.02539
•
Published
•
36
Visual Sketchpad: Sketching as a Visual Chain of Thought for Multimodal
Language Models
Paper
•
2406.09403
•
Published
•
20
What If We Recaption Billions of Web Images with LLaMA-3?
Paper
•
2406.08478
•
Published
•
40
mDPO: Conditional Preference Optimization for Multimodal Large Language
Models
Paper
•
2406.11839
•
Published
•
38
VoCo-LLaMA: Towards Vision Compression with Large Language Models
Paper
•
2406.12275
•
Published
•
30
MG-LLaVA: Towards Multi-Granularity Visual Instruction Tuning
Paper
•
2406.17770
•
Published
•
19
Math-LLaVA: Bootstrapping Mathematical Reasoning for Multimodal Large
Language Models
Paper
•
2406.17294
•
Published
•
11
OMG-LLaVA: Bridging Image-level, Object-level, Pixel-level Reasoning and
Understanding
Paper
•
2406.19389
•
Published
•
53
HuatuoGPT-Vision, Towards Injecting Medical Visual Knowledge into
Multimodal LLMs at Scale
Paper
•
2406.19280
•
Published
•
62
Understanding Alignment in Multimodal LLMs: A Comprehensive Study
Paper
•
2407.02477
•
Published
•
22
Unveiling Encoder-Free Vision-Language Models
Paper
•
2406.11832
•
Published
•
51
RULE: Reliable Multimodal RAG for Factuality in Medical Vision Language
Models
Paper
•
2407.05131
•
Published
•
24