HuatuoGPT-o1, Towards Medical Complex Reasoning with LLMs Paper • 2412.18925 • Published 9 days ago • 82
LLaVA-UHD v2: an MLLM Integrating High-Resolution Feature Pyramid via Hierarchical Window Transformer Paper • 2412.13871 • Published 16 days ago • 17
Multimodal Latent Language Modeling with Next-Token Diffusion Paper • 2412.08635 • Published 23 days ago • 41
AV-Odyssey Bench: Can Your Multimodal LLMs Really Understand Audio-Visual Information? Paper • 2412.02611 • Published Dec 3, 2024 • 23
Timestep Embedding Tells: It's Time to Cache for Video Diffusion Model Paper • 2411.19108 • Published Nov 28, 2024 • 17
AV-Odyssey Bench: Can Your Multimodal LLMs Really Understand Audio-Visual Information? Paper • 2412.02611 • Published Dec 3, 2024 • 23
VidEgoThink: Assessing Egocentric Video Understanding Capabilities for Embodied AI Paper • 2410.11623 • Published Oct 15, 2024 • 47
Efficiently Democratizing Medical LLMs for 50 Languages via a Mixture of Language Family Experts Paper • 2410.10626 • Published Oct 14, 2024 • 38
meta-llama/Llama-3.2-11B-Vision-Instruct Image-Text-to-Text • Updated about 1 month ago • 2.75M • • 1.18k
LongLLaVA: Scaling Multi-modal LLMs to 1000 Images Efficiently via Hybrid Architecture Paper • 2409.02889 • Published Sep 4, 2024 • 55