Running
4
π
Discover amazing AI apps made by the community!
VLMEvalKit Eval Results in video understanding benchmark
JudgerBench Leaderboard
CompassJudger Subjective Evaluation Learderboard
Learderboard to Evaluate Arabic Multimodal Models
A realistic benchmark with real CRM tasks for LLM agents.
Comparing GPT-4o, o1-mini, o1-preview, and Claude 3.5 Sonnet
A Benchmark of Large Language Models in the Clinic
Leaderboard showcasing Turkish MMLU dataset results.