MMMU: A Massive Multi-discipline Multimodal Understanding and Reasoning Benchmark for Expert AGI Paper • 2311.16502 • Published Nov 27, 2023 • 33
BLINK: Multimodal Large Language Models Can See but Not Perceive Paper • 2404.12390 • Published Apr 18 • 24
RULER: What's the Real Context Size of Your Long-Context Language Models? Paper • 2404.06654 • Published Apr 9 • 32
CantTalkAboutThis: Aligning Language Models to Stay on Topic in Dialogues Paper • 2404.03820 • Published Apr 4 • 22
CodeEditorBench: Evaluating Code Editing Capability of Large Language Models Paper • 2404.03543 • Published Apr 4 • 15
Revisiting Text-to-Image Evaluation with Gecko: On Metrics, Prompts, and Human Ratings Paper • 2404.16820 • Published Apr 25 • 15
SEED-Bench-2-Plus: Benchmarking Multimodal Large Language Models with Text-Rich Visual Comprehension Paper • 2404.16790 • Published Apr 25 • 7
On the Planning Abilities of Large Language Models -- A Critical Investigation Paper • 2305.15771 • Published May 25, 2023 • 1
Video-MME: The First-Ever Comprehensive Evaluation Benchmark of Multi-modal LLMs in Video Analysis Paper • 2405.21075 • Published May 31 • 16
Test of Time: A Benchmark for Evaluating LLMs on Temporal Reasoning Paper • 2406.09170 • Published Jun 13 • 24
MuirBench: A Comprehensive Benchmark for Robust Multi-image Understanding Paper • 2406.09411 • Published Jun 13 • 18
CS-Bench: A Comprehensive Benchmark for Large Language Models towards Computer Science Mastery Paper • 2406.08587 • Published Jun 12 • 15
MMDU: A Multi-Turn Multi-Image Dialog Understanding Benchmark and Instruction-Tuning Dataset for LVLMs Paper • 2406.11833 • Published Jun 17 • 61
ChartMimic: Evaluating LMM's Cross-Modal Reasoning Capability via Chart-to-Code Generation Paper • 2406.09961 • Published Jun 14 • 54
BABILong: Testing the Limits of LLMs with Long Context Reasoning-in-a-Haystack Paper • 2406.10149 • Published Jun 14 • 47