BLINK: Multimodal Large Language Models Can See but Not Perceive Paper • 2404.12390 • Published Apr 18 • 23
RULER: What's the Real Context Size of Your Long-Context Language Models? Paper • 2404.06654 • Published Apr 9 • 32
CantTalkAboutThis: Aligning Language Models to Stay on Topic in Dialogues Paper • 2404.03820 • Published Apr 4 • 21
CodeEditorBench: Evaluating Code Editing Capability of Large Language Models Paper • 2404.03543 • Published Apr 4 • 15
MMMU: A Massive Multi-discipline Multimodal Understanding and Reasoning Benchmark for Expert AGI Paper • 2311.16502 • Published Nov 27, 2023 • 33
Revisiting Text-to-Image Evaluation with Gecko: On Metrics, Prompts, and Human Ratings Paper • 2404.16820 • Published Apr 25 • 15
SEED-Bench-2-Plus: Benchmarking Multimodal Large Language Models with Text-Rich Visual Comprehension Paper • 2404.16790 • Published Apr 25 • 7
On the Planning Abilities of Large Language Models -- A Critical Investigation Paper • 2305.15771 • Published May 25, 2023 • 1