benchmark - a zzfive Collection

zzfive 's Collections

safety

inference optimization

RL+reason model

medical

3d

image

LLMs

video

agent

cv

audio

robot

benchmark

updated about 9 hours ago

GATE OpenING: A Comprehensive Benchmark for Judging Open-ended Interleaved Image-Text Generation

Paper • 2411.18499 • Published Nov 27, 2024 • 18
VLSBench: Unveiling Visual Leakage in Multimodal Safety

Paper • 2411.19939 • Published Nov 29, 2024 • 10
AV-Odyssey Bench: Can Your Multimodal LLMs Really Understand Audio-Visual Information?

Paper • 2412.02611 • Published Dec 3, 2024 • 23
U-MATH: A University-Level Benchmark for Evaluating Mathematical Skills in LLMs

Paper • 2412.03205 • Published Dec 4, 2024 • 16
ProcessBench: Identifying Process Errors in Mathematical Reasoning

Paper • 2412.06559 • Published Dec 9, 2024 • 80
OmniDocBench: Benchmarking Diverse PDF Document Parsing with Comprehensive Annotations

Paper • 2412.07626 • Published Dec 10, 2024 • 22
VisionArena: 230K Real World User-VLM Conversations with Preference Labels

Paper • 2412.08687 • Published Dec 11, 2024 • 13
SCBench: A KV Cache-Centric Analysis of Long-Context Methods

Paper • 2412.10319 • Published Dec 13, 2024 • 10
Are Your LLMs Capable of Stable Reasoning?

Paper • 2412.13147 • Published Dec 17, 2024 • 92
Multi-Dimensional Insights: Benchmarking Real-World Personalization in Large Multimodal Models

Paper • 2412.12606 • Published Dec 17, 2024 • 41
TheAgentCompany: Benchmarking LLM Agents on Consequential Real World Tasks

Paper • 2412.14161 • Published Dec 18, 2024 • 51
RAG-RewardBench: Benchmarking Reward Models in Retrieval Augmented Generation for Preference Alignment

Paper • 2412.13746 • Published Dec 18, 2024 • 9
CodeElo: Benchmarking Competition-level Code Generation of LLMs with Human-comparable Elo Ratings

Paper • 2501.01257 • Published Jan 2 • 48
MotionBench: Benchmarking and Improving Fine-grained Video Motion Understanding for Vision Language Models

Paper • 2501.02955 • Published Jan 6 • 40
OVO-Bench: How Far is Your Video-LLMs from Real-World Online Video Understanding?

Paper • 2501.05510 • Published Jan 9 • 39
WebWalker: Benchmarking LLMs in Web Traversal

Paper • 2501.07572 • Published Jan 13 • 19
HALoGEN: Fantastic LLM Hallucinations and Where to Find Them

Paper • 2501.08292 • Published about 1 month ago • 17
MMDocIR: Benchmarking Multi-Modal Retrieval for Long Documents

Paper • 2501.08828 • Published 30 days ago • 30
ComplexFuncBench: Exploring Multi-Step and Constrained Function Calling under Long-Context Scenario

Paper • 2501.10132 • Published 28 days ago • 17
MMVU: Measuring Expert-Level Multi-Discipline Video Understanding

Paper • 2501.12380 • Published 24 days ago • 82
EmbodiedEval: Evaluate Multimodal LLMs as Embodied Agents

Paper • 2501.11858 • Published 24 days ago • 5
Humanity's Last Exam

Paper • 2501.14249 • Published 21 days ago • 62
Redundancy Principles for MLLMs Benchmarks

Paper • 2501.13953 • Published 25 days ago • 27
MM-IQ: Benchmarking Human-Like Abstraction and Reasoning in Multimodal Models

Paper • 2502.00698 • Published 12 days ago • 22
BenchMAX: A Comprehensive Multilingual Evaluation Suite for Large Language Models

Paper • 2502.07346 • Published 3 days ago • 39
SARChat-Bench-2M: A Multi-Task Vision-Language Benchmark for SAR Image Interpretation

Paper • 2502.08168 • Published 2 days ago • 10