MedAgentsBench: Benchmarking Thinking Models and Agent Frameworks for Complex Medical Reasoning Paper • 2503.07459 • Published 6 days ago • 14 • 3
ProBench: Judging Multimodal Foundation Models on Open-ended Multi-domain Expert Tasks Paper • 2503.06885 • Published 7 days ago • 3 • 3
MinorBench: A hand-built benchmark for content-based risks for children Paper • 2503.10242 • Published 3 days ago • 4 • 3
MedAgentsBench: Benchmarking Thinking Models and Agent Frameworks for Complex Medical Reasoning Paper • 2503.07459 • Published 6 days ago • 14 • 3
FEA-Bench: A Benchmark for Evaluating Repository-Level Code Generation for Feature Implementation Paper • 2503.06680 • Published 7 days ago • 17 • 7
VisualPRM: An Effective Process Reward Model for Multimodal Reasoning Paper • 2503.10291 • Published 3 days ago • 29 • 3
Charting and Navigating Hugging Face's Model Atlas Paper • 2503.10633 • Published 3 days ago • 52 • 5
Charting and Navigating Hugging Face's Model Atlas Paper • 2503.10633 • Published 3 days ago • 52 • 5
Block Diffusion: Interpolating Between Autoregressive and Diffusion Language Models Paper • 2503.09573 • Published 4 days ago • 50 • 3
VisualSimpleQA: A Benchmark for Decoupled Evaluation of Large Vision-Language Models in Fact-Seeking Question Answering Paper • 2503.06492 • Published 7 days ago • 9 • 4
VisualSimpleQA: A Benchmark for Decoupled Evaluation of Large Vision-Language Models in Fact-Seeking Question Answering Paper • 2503.06492 • Published 7 days ago • 9 • 4
Retrieval Models Aren't Tool-Savvy: Benchmarking Tool Retrieval for Large Language Models Paper • 2503.01763 • Published 13 days ago • 4 • 2
FLAME: A Federated Learning Benchmark for Robotic Manipulation Paper • 2503.01729 • Published 13 days ago • 4 • 2
Benchmarking Large Language Models for Multi-Language Software Vulnerability Detection Paper • 2503.01449 • Published 13 days ago • 4 • 2
CognitiveDrone: A VLA Model and Evaluation Benchmark for Real-Time Cognitive Task Solving and Reasoning in UAVs Paper • 2503.01378 • Published 13 days ago • 3 • 2
SwiLTra-Bench: The Swiss Legal Translation Benchmark Paper • 2503.01372 • Published 13 days ago • 3 • 2
LServe: Efficient Long-sequence LLM Serving with Unified Sparse Attention Paper • 2502.14866 • Published 24 days ago • 12 • 2