MRGen: Diffusion-based Controllable Data Engine for MRI Segmentation towards Unannotated Modalities Paper • 2412.04106 • Published Dec 4, 2024 • 5
MME-Survey: A Comprehensive Survey on Evaluation of Multimodal LLMs Paper • 2411.15296 • Published Nov 22, 2024 • 19
VideoAutoArena: An Automated Arena for Evaluating Large Multimodal Models in Video Analysis through User Simulation Paper • 2411.13281 • Published Nov 20, 2024 • 17
RetrieveGPT: Merging Prompts and Mathematical Models for Enhanced Code-Mixed Information Retrieval Paper • 2411.04752 • Published Nov 7, 2024 • 16 • 3
SEA: Supervised Embedding Alignment for Token-Level Visual-Textual Integration in MLLMs Paper • 2408.11813 • Published Aug 21, 2024 • 11
EVLM: An Efficient Vision-Language Model for Visual Understanding Paper • 2407.14177 • Published Jul 19, 2024 • 43
InternVL2.0 Collection Expanding Performance Boundaries of Open-Source MLLM • 15 items • Updated 3 days ago • 88
What Matters in Detecting AI-Generated Videos like Sora? Paper • 2406.19568 • Published Jun 27, 2024 • 13
MG-LLaVA: Towards Multi-Granularity Visual Instruction Tuning Paper • 2406.17770 • Published Jun 25, 2024 • 19
ConvLLaVA: Hierarchical Backbones as Visual Encoder for Large Multimodal Models Paper • 2405.15738 • Published May 24, 2024 • 43
Chameleon: Mixed-Modal Early-Fusion Foundation Models Paper • 2405.09818 • Published May 16, 2024 • 127
PLLaVA : Parameter-free LLaVA Extension from Images to Videos for Video Dense Captioning Paper • 2404.16994 • Published Apr 25, 2024 • 35
How Far Are We to GPT-4V? Closing the Gap to Commercial Multimodal Models with Open-Source Suites Paper • 2404.16821 • Published Apr 25, 2024 • 55
LVLM-Intrepret: An Interpretability Tool for Large Vision-Language Models Paper • 2404.03118 • Published Apr 3, 2024 • 23
CoMat: Aligning Text-to-Image Diffusion Model with Image-to-Text Concept Matching Paper • 2404.03653 • Published Apr 4, 2024 • 33