Pangea: A Fully Open Multilingual Multimodal LLM for 39 Languages Paper • 2410.16153 • Published Oct 21 • 43
The Curse of Multi-Modalities: Evaluating Hallucinations of Large Multimodal Models across Language, Visual, and Audio Paper • 2410.12787 • Published Oct 16 • 30
LEOPARD : A Vision Language Model For Text-Rich Multi-Image Tasks Paper • 2410.01744 • Published Oct 2 • 26
UCFE: A User-Centric Financial Expertise Benchmark for Large Language Models Paper • 2410.14059 • Published Oct 17 • 54
Molmo and PixMo: Open Weights and Open Data for State-of-the-Art Multimodal Models Paper • 2409.17146 • Published Sep 25 • 104
Both Text and Images Leaked! A Systematic Analysis of Multimodal LLM Data Contamination Paper • 2411.03823 • Published Nov 6 • 43
SageAttention: Accurate 8-Bit Attention for Plug-and-play Inference Acceleration Paper • 2410.02367 • Published Oct 3 • 47
VLM2Vec: Training Vision-Language Models for Massive Multimodal Embedding Tasks Paper • 2410.05160 • Published Oct 7 • 4
LOKI: A Comprehensive Synthetic Data Detection Benchmark using Large Multimodal Models Paper • 2410.09732 • Published Oct 13 • 54
PyramidDrop: Accelerating Your Large Vision-Language Models via Pyramid Visual Redundancy Reduction Paper • 2410.17247 • Published Oct 22 • 45
SageAttention2 Technical Report: Accurate 4 Bit Attention for Plug-and-play Inference Acceleration Paper • 2411.10958 • Published Nov 17 • 50
Awaker2.5-VL: Stably Scaling MLLMs with Parameter-Efficient Mixture of Experts Paper • 2411.10669 • Published Nov 16 • 10
LLM2CLIP: Powerful Language Model Unlock Richer Visual Representation Paper • 2411.04997 • Published Nov 7 • 37
Mixture-of-Transformers: A Sparse and Scalable Architecture for Multi-Modal Foundation Models Paper • 2411.04996 • Published Nov 7 • 49
Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling Paper • 2412.05271 • Published 16 days ago • 116
MegaPairs: Massive Data Synthesis For Universal Multimodal Retrieval Paper • 2412.14475 • Published 3 days ago • 47
POINTS1.5: Building a Vision-Language Model towards Real World Applications Paper • 2412.08443 • Published 11 days ago • 38