-
MM-LLMs: Recent Advances in MultiModal Large Language Models
Paper • 2401.13601 • Published • 42 -
A Touch, Vision, and Language Dataset for Multimodal Alignment
Paper • 2402.13232 • Published • 12 -
Neural Network Diffusion
Paper • 2402.13144 • Published • 94 -
FlashTex: Fast Relightable Mesh Texturing with LightControlNet
Paper • 2402.13251 • Published • 13
Collections
Discover the best community collections!
Collections including paper arxiv:2405.20204
-
LLaVAR: Enhanced Visual Instruction Tuning for Text-Rich Image Understanding
Paper • 2306.17107 • Published • 12 -
On the Hidden Mystery of OCR in Large Multimodal Models
Paper • 2305.07895 • Published -
Qwen-VL: A Frontier Large Vision-Language Model with Versatile Abilities
Paper • 2308.12966 • Published • 6 -
MoE-LLaVA: Mixture of Experts for Large Vision-Language Models
Paper • 2401.15947 • Published • 48
-
Contrastive Decoding Improves Reasoning in Large Language Models
Paper • 2309.09117 • Published • 37 -
RMT: Retentive Networks Meet Vision Transformers
Paper • 2309.11523 • Published • 32 -
Guiding a Diffusion Model with a Bad Version of Itself
Paper • 2406.02507 • Published • 15 -
Jina CLIP: Your CLIP Model Is Also Your Text Retriever
Paper • 2405.20204 • Published • 28