-
EVLM: An Efficient Vision-Language Model for Visual Understanding
Paper • 2407.14177 • Published • 42 -
ChartGemma: Visual Instruction-tuning for Chart Reasoning in the Wild
Paper • 2407.04172 • Published • 22 -
facebook/chameleon-7b
Image-Text-to-Text • Updated • 20.5k • 166 -
vidore/colpali
Updated • 30.7k • 380
Collections
Discover the best community collections!
Collections including paper arxiv:2407.04172
-
Math-LLaVA: Bootstrapping Mathematical Reasoning for Multimodal Large Language Models
Paper • 2406.17294 • Published • 10 -
TokenPacker: Efficient Visual Projector for Multimodal LLM
Paper • 2407.02392 • Published • 21 -
Understanding Alignment in Multimodal LLMs: A Comprehensive Study
Paper • 2407.02477 • Published • 21 -
InternLM-XComposer-2.5: A Versatile Large Vision Language Model Supporting Long-Contextual Input and Output
Paper • 2407.03320 • Published • 92
-
CharXiv: Charting Gaps in Realistic Chart Understanding in Multimodal LLMs
Paper • 2406.18521 • Published • 28 -
We-Math: Does Your Large Multimodal Model Achieve Human-like Mathematical Reasoning?
Paper • 2407.01284 • Published • 75 -
ChartGemma: Visual Instruction-tuning for Chart Reasoning in the Wild
Paper • 2407.04172 • Published • 22
-
iVideoGPT: Interactive VideoGPTs are Scalable World Models
Paper • 2405.15223 • Published • 12 -
Meteor: Mamba-based Traversal of Rationale for Large Language and Vision Models
Paper • 2405.15574 • Published • 53 -
An Introduction to Vision-Language Modeling
Paper • 2405.17247 • Published • 85 -
Matryoshka Multimodal Models
Paper • 2405.17430 • Published • 31
-
Chameleon: Mixed-Modal Early-Fusion Foundation Models
Paper • 2405.09818 • Published • 126 -
Matryoshka Multimodal Models
Paper • 2405.17430 • Published • 31 -
Seed-TTS: A Family of High-Quality Versatile Speech Generation Models
Paper • 2406.02430 • Published • 30 -
An Image is Worth 32 Tokens for Reconstruction and Generation
Paper • 2406.07550 • Published • 55
-
LLaVAR: Enhanced Visual Instruction Tuning for Text-Rich Image Understanding
Paper • 2306.17107 • Published • 11 -
On the Hidden Mystery of OCR in Large Multimodal Models
Paper • 2305.07895 • Published -
Qwen-VL: A Frontier Large Vision-Language Model with Versatile Abilities
Paper • 2308.12966 • Published • 7 -
MoE-LLaVA: Mixture of Experts for Large Vision-Language Models
Paper • 2401.15947 • Published • 49
-
Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer
Paper • 1701.06538 • Published • 5 -
Sparse Networks from Scratch: Faster Training without Losing Performance
Paper • 1907.04840 • Published • 3 -
ZeRO: Memory Optimizations Toward Training Trillion Parameter Models
Paper • 1910.02054 • Published • 4 -
A Mixture of h-1 Heads is Better than h Heads
Paper • 2005.06537 • Published • 2