WildBench: Benchmarking LLMs with Challenging Tasks from Real Users in the Wild Paper • 2406.04770 • Published Jun 7, 2024 • 28
Large Language Model Confidence Estimation via Black-Box Access Paper • 2406.04370 • Published Jun 1, 2024 • 21