The Differences Between Direct Alignment Algorithms are a Blur Paper • 2502.01237 • Published 4 days ago • 108
Qwen2.5-VL Collection Vision-language model series based on Qwen2.5 • 3 items • Updated 11 days ago • 324
3DGraphLLM: Combining Semantic Graphs and Large Language Models for 3D Scene Understanding Paper • 2412.18450 • Published Dec 24, 2024 • 33
PaliGemma 2: A Family of Versatile VLMs for Transfer Paper • 2412.03555 • Published Dec 4, 2024 • 126
ChatRex: Taming Multimodal LLM for Joint Perception and Understanding Paper • 2411.18363 • Published Nov 27, 2024 • 10
PPLLaVA: Varied Video Sequence Understanding With Prompt Guidance Paper • 2411.02327 • Published Nov 4, 2024 • 11
WebRL: Training LLM Web Agents via Self-Evolving Online Curriculum Reinforcement Learning Paper • 2411.02337 • Published Nov 4, 2024 • 35
Inference Optimal VLMs Need Only One Visual Token but Larger Models Paper • 2411.03312 • Published Nov 5, 2024 • 6
Building and better understanding vision-language models: insights and future directions Paper • 2408.12637 • Published Aug 22, 2024 • 125
Wavelets Are All You Need for Autoregressive Image Generation Paper • 2406.19997 • Published Jun 28, 2024 • 30
xGen-MM (BLIP-3): A Family of Open Large Multimodal Models Paper • 2408.08872 • Published Aug 16, 2024 • 98
Diversity Empowers Intelligence: Integrating Expertise of Software Engineering Agents Paper • 2408.07060 • Published Aug 13, 2024 • 42
Medical SAM 2: Segment medical images as video via Segment Anything Model 2 Paper • 2408.00874 • Published Aug 1, 2024 • 48
Cobra: Extending Mamba to Multi-Modal Large Language Model for Efficient Inference Paper • 2403.14520 • Published Mar 21, 2024 • 34