-
Skill-Mix: a Flexible and Expandable Family of Evaluations for AI models
Paper • 2310.17567 • Published • 1 -
This is not a Dataset: A Large Negation Benchmark to Challenge Large Language Models
Paper • 2310.15941 • Published • 6 -
Holistic Evaluation of Language Models
Paper • 2211.09110 • Published • 1 -
INSTRUCTEVAL: Towards Holistic Evaluation of Instruction-Tuned Large Language Models
Paper • 2306.04757 • Published • 4
Collections
Discover the best community collections!
Collections including paper arxiv:2306.05685
-
JudgeLM: Fine-tuned Large Language Models are Scalable Judges
Paper • 2310.17631 • Published • 31 -
Prometheus: Inducing Fine-grained Evaluation Capability in Language Models
Paper • 2310.08491 • Published • 50 -
Generative Judge for Evaluating Alignment
Paper • 2310.05470 • Published • 1 -
Calibrating LLM-Based Evaluator
Paper • 2309.13308 • Published • 10
-
Judging LLM-as-a-judge with MT-Bench and Chatbot Arena
Paper • 2306.05685 • Published • 24 -
prometheus-eval/Feedback-Collection
Viewer • Updated • 100k • 266 • 94 -
prometheus-eval/prometheus-13b-v1.0
Text2Text Generation • Updated • 6.09k • 116 -
HuggingFaceH4/ultrafeedback_binarized
Viewer • Updated • 187k • 37.5k • 193