On Memorization of Large Language Models in Logical Reasoning Paper β’ 2410.23123 β’ Published Oct 30, 2024 β’ 18
MEGA-Bench: Scaling Multimodal Evaluation to over 500 Real-World Tasks Paper β’ 2410.10563 β’ Published Oct 14, 2024 β’ 38
Automatically Correcting Large Language Models: Surveying the landscape of diverse self-correction strategies Paper β’ 2308.03188 β’ Published Aug 6, 2023 β’ 2
Let's Think Frame by Frame: Evaluating Video Chain of Thought with Video Infilling and Prediction Paper β’ 2305.13903 β’ Published May 23, 2023
Visual Chain of Thought: Bridging Logical Gaps with Multimodal Infillings Paper β’ 2305.02317 β’ Published May 3, 2023
WikiWhy: Answering and Explaining Cause-and-Effect Questions Paper β’ 2210.12152 β’ Published Oct 21, 2022 β’ 1
Not All Errors are Equal: Learning Text Generation Metrics using Stratified Error Synthesis Paper β’ 2210.05035 β’ Published Oct 10, 2022
TC-Bench: Benchmarking Temporal Compositionality in Text-to-Video and Image-to-Video Generation Paper β’ 2406.08656 β’ Published Jun 12, 2024 β’ 7
Losing Visual Needles in Image Haystacks: Vision Language Models are Easily Distracted in Short and Long Contexts Paper β’ 2406.16851 β’ Published Jun 24, 2024
WildGuard: Open One-Stop Moderation Tools for Safety Risks, Jailbreaks, and Refusals of LLMs Paper β’ 2406.18495 β’ Published Jun 26, 2024 β’ 12
MantisScore: Building Automatic Metrics to Simulate Fine-grained Human Feedback for Video Generation Paper β’ 2406.15252 β’ Published Jun 21, 2024 β’ 14
MantisScore: Building Automatic Metrics to Simulate Fine-grained Human Feedback for Video Generation Paper β’ 2406.15252 β’ Published Jun 21, 2024 β’ 14