Inference Optimal VLMs Need Only One Visual Token but Larger Models Paper • 2411.03312 • Published 18 days ago • 6
FactAlign: Long-form Factuality Alignment of Large Language Models Paper • 2410.01691 • Published Oct 2 • 8
Attention Prompting on Image for Large Vision-Language Models Paper • 2409.17143 • Published Sep 25 • 7
Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution Paper • 2409.12191 • Published Sep 18 • 74
InfiMM-WebMath-40B: Advancing Multimodal Pre-Training for Enhanced Mathematical Reasoning Paper • 2409.12568 • Published Sep 19 • 47
MMMU-Pro: A More Robust Multi-discipline Multimodal Understanding Benchmark Paper • 2409.02813 • Published Sep 4 • 28
Building and better understanding vision-language models: insights and future directions Paper • 2408.12637 • Published Aug 22 • 118
Vision-Language Modeling Collection Our datasets and models for Visual-Language Modeling • 5 items • Updated Jul 26 • 6
Multimodal Self-Instruct: Synthetic Abstract Image and Visual Reasoning Instruction Using Language Model Paper • 2407.07053 • Published Jul 9 • 41
OmniCorpus: A Unified Multimodal Corpus of 10 Billion-Level Images Interleaved with Text Paper • 2406.08418 • Published Jun 12 • 28
Visual Autoregressive Modeling: Scalable Image Generation via Next-Scale Prediction Paper • 2404.02905 • Published Apr 3 • 64
Linear Transformers with Learnable Kernel Functions are Better In-Context Models Paper • 2402.10644 • Published Feb 16 • 79