M-Longdoc: A Benchmark For Multimodal Super-Long Document Understanding And A Retrieval-Aware Tuning Framework Paper • 2411.06176 • Published 4 days ago • 40
LLM2CLIP: Powerful Language Model Unlock Richer Visual Representation Paper • 2411.04997 • Published 6 days ago • 24
M3SciQA: A Multi-Modal Multi-Document Scientific QA Benchmark for Evaluating Foundation Models Paper • 2411.04075 • Published 7 days ago • 13
VideoGLaMM: A Large Multimodal Model for Pixel-Level Visual Grounding in Videos Paper • 2411.04923 • Published 6 days ago • 20
EMMA: End-to-End Multimodal Model for Autonomous Driving Paper • 2410.23262 • Published 14 days ago • 2
DynaMath: A Dynamic Visual Benchmark for Evaluating Mathematical Reasoning Robustness of Vision Language Models Paper • 2411.00836 • Published 15 days ago • 15
TOMATO: Assessing Visual Temporal Reasoning Capabilities in Multimodal Foundation Models Paper • 2410.23266 • Published 14 days ago • 19
OS-ATLAS: A Foundation Action Model for Generalist GUI Agents Paper • 2410.23218 • Published 14 days ago • 45
BenchX: A Unified Benchmark Framework for Medical Vision-Language Pretraining on Chest X-Rays Paper • 2410.21969 • Published 15 days ago • 8
MMAU: A Massive Multi-Task Audio Understanding and Reasoning Benchmark Paper • 2410.19168 • Published 20 days ago • 19
Infinity-MM: Scaling Multimodal Performance with Large-Scale and High-Quality Instruction Data Paper • 2410.18558 • Published 20 days ago • 17
CLEAR: Character Unlearning in Textual and Visual Modalities Paper • 2410.18057 • Published 21 days ago • 198
VideoWebArena: Evaluating Long Context Multimodal Agents with Video Understanding Web Tasks Paper • 2410.19100 • Published 20 days ago • 6
Document Parsing Unveiled: Techniques, Challenges, and Prospects for Structured Information Extraction Paper • 2410.21169 • Published 16 days ago • 29
NaturalBench: Evaluating Vision-Language Models on Natural Adversarial Samples Paper • 2410.14669 • Published 26 days ago • 35
TP-Eval: Tap Multimodal LLMs' Potential in Evaluation by Customizing Prompts Paper • 2410.18071 • Published 21 days ago • 6
LongVU: Spatiotemporal Adaptive Compression for Long Video-Language Understanding Paper • 2410.17434 • Published 22 days ago • 24
MIA-DPO: Multi-Image Augmented Direct Preference Optimization For Large Vision-Language Models Paper • 2410.17637 • Published 21 days ago • 34