ChatRex: Taming Multimodal LLM for Joint Perception and Understanding Paper • 2411.18363 • Published 1 day ago • 5
PPLLaVA: Varied Video Sequence Understanding With Prompt Guidance Paper • 2411.02327 • Published 24 days ago • 11
WebRL: Training LLM Web Agents via Self-Evolving Online Curriculum Reinforcement Learning Paper • 2411.02337 • Published 24 days ago • 36
Inference Optimal VLMs Need Only One Visual Token but Larger Models Paper • 2411.03312 • Published 23 days ago • 6
Building and better understanding vision-language models: insights and future directions Paper • 2408.12637 • Published Aug 22 • 121
Wavelets Are All You Need for Autoregressive Image Generation Paper • 2406.19997 • Published Jun 28 • 29
xGen-MM (BLIP-3): A Family of Open Large Multimodal Models Paper • 2408.08872 • Published Aug 16 • 97
Diversity Empowers Intelligence: Integrating Expertise of Software Engineering Agents Paper • 2408.07060 • Published Aug 13 • 40
Medical SAM 2: Segment medical images as video via Segment Anything Model 2 Paper • 2408.00874 • Published Aug 1 • 43
Cobra: Extending Mamba to Multi-Modal Large Language Model for Efficient Inference Paper • 2403.14520 • Published Mar 21 • 33
🍃 MINT-1T Collection Data for "MINT-1T: Scaling Open-Source Multimodal Data by 10x: A Multimodal Dataset with One Trillion Tokens" • 13 items • Updated Jul 24 • 54
INF-LLaVA: Dual-perspective Perception for High-Resolution Multimodal Large Language Model Paper • 2407.16198 • Published Jul 23 • 13
POGEMA: A Benchmark Platform for Cooperative Multi-Agent Navigation Paper • 2407.14931 • Published Jul 20 • 20
TabReD: A Benchmark of Tabular Machine Learning in-the-Wild Paper • 2406.19380 • Published Jun 27 • 47