leonardlin
's Collections
CRUXEval: A Benchmark for Code Reasoning, Understanding and Execution
Paper
•
2401.03065
•
Published
•
11
Is Your Code Generated by ChatGPT Really Correct? Rigorous Evaluation of
Large Language Models for Code Generation
Paper
•
2305.01210
•
Published
•
4
AGIBench: A Multi-granularity, Multimodal, Human-referenced,
Auto-scoring Benchmark for Large Language Models
Paper
•
2309.06495
•
Published
•
1
MMMU: A Massive Multi-discipline Multimodal Understanding and Reasoning
Benchmark for Expert AGI
Paper
•
2311.16502
•
Published
•
35
GAIA: a benchmark for General AI Assistants
Paper
•
2311.12983
•
Published
•
185
GPQA: A Graduate-Level Google-Proof Q&A Benchmark
Paper
•
2311.12022
•
Published
•
25
PromptBench: A Unified Library for Evaluation of Large Language Models
Paper
•
2312.07910
•
Published
•
15
Quantifying Language Models' Sensitivity to Spurious Features in Prompt
Design or: How I learned to start worrying about prompt formatting
Paper
•
2310.11324
•
Published
•
1
TrustLLM: Trustworthiness in Large Language Models
Paper
•
2401.05561
•
Published
•
66
Benchmarking LLMs via Uncertainty Quantification
Paper
•
2401.12794
•
Published
•
1
When Benchmarks are Targets: Revealing the Sensitivity of Large Language
Model Leaderboards
Paper
•
2402.01781
•
Published
•
1
VBench: Comprehensive Benchmark Suite for Video Generative Models
Paper
•
2311.17982
•
Published
•
7
HarmBench: A Standardized Evaluation Framework for Automated Red Teaming
and Robust Refusal
Paper
•
2402.04249
•
Published
•
4
OpenToM: A Comprehensive Benchmark for Evaluating Theory-of-Mind
Reasoning Capabilities of Large Language Models
Paper
•
2402.06044
•
Published
•
1
G-Eval: NLG Evaluation using GPT-4 with Better Human Alignment
Paper
•
2303.16634
•
Published
•
3
LLM Comparator: Visual Analytics for Side-by-Side Evaluation of Large
Language Models
Paper
•
2402.10524
•
Published
•
22
Mind Your Format: Towards Consistent Evaluation of In-Context Learning
Improvements
Paper
•
2401.06766
•
Published
•
2
Beyond Probabilities: Unveiling the Misalignment in Evaluating Large
Language Models
Paper
•
2402.13887
•
Published
•
1
tinyBenchmarks: evaluating LLMs with fewer examples
Paper
•
2402.14992
•
Published
•
11
Functional Benchmarks for Robust Evaluation of Reasoning Performance,
and the Reasoning Gap
Paper
•
2402.19450
•
Published
•
3
Chatbot Arena: An Open Platform for Evaluating LLMs by Human Preference
Paper
•
2403.04132
•
Published
•
38
LiveCodeBench: Holistic and Contamination Free Evaluation of Large
Language Models for Code
Paper
•
2403.07974
•
Published
•
1
Replacing Judges with Juries: Evaluating LLM Generations with a Panel of
Diverse Models
Paper
•
2404.18796
•
Published
•
68
WildBench: Benchmarking LLMs with Challenging Tasks from Real Users in
the Wild
Paper
•
2406.04770
•
Published
•
27