Multi-Modal Model - a wanghaofan Collection

wanghaofan 's Collections

Multi-Modal Model

Multi-Modal Model

updated Sep 24, 2024

What matters when building vision-language models?

Paper • 2405.02246 • Published May 3, 2024 • 102
MUMU: Bootstrapping Multimodal Image Generation from Text-to-Image Data

Paper • 2406.18790 • Published Jun 26, 2024 • 34
Building and better understanding vision-language models: insights and future directions

Paper • 2408.12637 • Published Aug 22, 2024 • 125
Show-o: One Single Transformer to Unify Multimodal Understanding and Generation

Paper • 2408.12528 • Published Aug 22, 2024 • 51
Transfusion: Predict the Next Token and Diffuse Images with One Multi-Modal Model

Paper • 2408.11039 • Published Aug 20, 2024 • 59
xGen-MM (BLIP-3): A Family of Open Large Multimodal Models

Paper • 2408.08872 • Published Aug 16, 2024 • 98
Diffusion Forcing: Next-token Prediction Meets Full-Sequence Diffusion

Paper • 2407.01392 • Published Jul 1, 2024 • 40
IPAdapter-Instruct: Resolving Ambiguity in Image-based Conditioning using Instruct Prompts

Paper • 2408.03209 • Published Aug 6, 2024 • 22
Qwen/Qwen2-VL-7B-Instruct

Image-Text-to-Text • Updated 2 days ago • 1.75M • 1.11k
THUDM/cogvlm2-llama3-chat-19B

Text Generation • Updated Sep 3, 2024 • 7.64k • 208
Eagle: Exploring The Design Space for Multimodal LLMs with Mixture of Encoders

Paper • 2408.15998 • Published Aug 28, 2024 • 86
OmniGen: Unified Image Generation

Paper • 2409.11340 • Published Sep 17, 2024 • 111
PixWizard: Versatile Image-to-Image Visual Assistant with Open-Language Instructions

Paper • 2409.15278 • Published Sep 23, 2024 • 24