arxiv:2310.08491

Prometheus: Inducing Fine-grained Evaluation Capability in Language Models

Published on Oct 12, 2023

· Submitted by

akhaliq on Oct 12, 2023

#1 Paper of the day

Upvote

Authors:

Seungone Kim ,

Joel Jang ,

Shayne Longpre ,

Hwaran Lee ,

Sangdoo Yun ,

Seongjin Shin ,

Sungdong Kim ,

James Thorne ,

Minjoon Seo

Abstract

Recently, using a powerful proprietary Large Language Model (LLM) (e.g., GPT-4) as an evaluator for long-form responses has become the de facto standard. However, for practitioners with large-scale evaluation tasks and custom criteria in consideration (e.g., child-readability), using proprietary LLMs as an evaluator is unreliable due to the closed-source nature, uncontrolled versioning, and prohibitive costs. In this work, we propose Prometheus, a fully open-source LLM that is on par with GPT-4's evaluation capabilities when the appropriate reference materials (reference answer, score rubric) are accompanied. We first construct the Feedback Collection, a new dataset that consists of 1K fine-grained score rubrics, 20K instructions, and 100K responses and language feedback generated by GPT-4. Using the Feedback Collection, we train Prometheus, a 13B evaluator LLM that can assess any given long-form text based on customized score rubric provided by the user. Experimental results show that Prometheus scores a Pearson correlation of 0.897 with human evaluators when evaluating with 45 customized score rubrics, which is on par with GPT-4 (0.882), and greatly outperforms ChatGPT (0.392). Furthermore, measuring correlation with GPT-4 with 1222 customized score rubrics across four benchmarks (MT Bench, Vicuna Bench, Feedback Bench, Flask Eval) shows similar trends, bolstering Prometheus's capability as an evaluator LLM. Lastly, Prometheus achieves the highest accuracy on two human preference benchmarks (HHH Alignment & MT Bench Human Judgment) compared to open-sourced reward models explicitly trained on human preference datasets, highlighting its potential as an universal reward model. We open-source our code, dataset, and model at https://github.com/kaistAI/Prometheus.

View arXiv page View PDF Add to collection

Community

Cartinoe5930

Oct 14, 2023

The review of 'Prometheus🔥: Inducing Fine-grained Evaluation Capability in Language Models'! 🤗

Abstract

Prometheus was proposed in the paper. This model is a fully open-source LLM that is on par with GPT-4's evaluation capabilities.
Feedback Collection which used for training of Prometheus was proposed in the paper. This dataset consists of 1K fine-grained score rubrics, 20K instruction, and 100K responses and language feedback generated by GPT-4.
Prometheus shows high correlation with GPT-4. In addition, Prometheus achieves the high accuracy on two human preference benchmarks(HHH Alignment & MT-Bench Human Judgment)

The Feedback Collection Dataset

Feedback Collection, a new dataset for the sole purpose of fine-tuning the Prometheus. The 4 considerations of Feedback Collection are as follows:

Including as many reference materials(i.e. reference answer & scoring rubric) as possible.
Maintaining a uniform length among the reference answer for each score(1 to 5) to prevent undesired length bias.
Maintaining a uniform score distribution to prevent undesired decision bias.
Limiting the scope of instructions and responses to realistic situations where a user is interacting with a LLM.

The individual components of Feedback Collection are as follows:

The components of Feedback Collection

Input
1. Instruction: An instruction that a user would prompt to an arbitrary LLM.
2. Response to Evaluate: A response to the instruction that the evaluator LM has to evaluate.
3. Customized Score Rubric: A specification of novel criteria decided by the user.
4. Reference Answer: A reference answer that would receive a score of 5. It enables the evaluator to use the mutual information between the reference answer and the response to make a scoring decision.
Output
1. Feedback: A rationale of why the provided response would receive a particular score.
2. Score: An integer score for the provided response that ranges from 1 to 5.

Fine-tuning an Evaluator LM

Using the Feedback Collection dataset, Llama-2-Chat(7B & 13B) were fine-tuned -> Prometheus🔥

Experiment Setting

In the paper, human evaluation & GPT-4 Evaluation were used as a standard and measure how similarly Prometheus and baselines could closely simulate them. For this, Absolute Grading & Ranking Grading were used.

Absolute Grading: The evaluator LM should generate a feedback and score within the range of 1 to 5 given an instruction, a response to evaluate, and reference materials. The following three experiments were conducted using Absolute Grading:
1. Measuring the correlation with human evaluator
2. Comparing the quality of the feedback using human evaluation
3. Measuring the correlation with GPT-4 evaluation
- Used Benchmarks: Feedback Bench, Vicuna Bench, MT-Bench, FLASK Eval
Ranking Grading: When two responses are given, a method of evaluating which response is better by scoring.
- Used Benchmarks: MT-Bench Human Judgment, HHH Alignment

Experimental Results

1. Can Prometheus Closely Simulate Human Evaluation?

Correlation with Human Scoring: Prometheus shows high correlation in all Feedback Bench, MT-Bench, and Vicuna Bench. This is on par with GPT-4 or slightly outperforms.
Pairwise Comparison of the Feedback with Human Evaluation: In terms of human preference, Prometheus shows a tendency to be preferred over other models.
Analysis of Why Prometheus's Feedback was Preferred: GPT-4 tends to be more meutral and abstract, Prometheus shows a clear trend of expressing its opinion of whether the given response is good or not.

2. Can Prometheus Closely Simulate GPT-4 Evaluation?

Correlation with GPT-4 Scoring: Prometheus shows the high correlation with GPT-4 compared to other better models such as Llama-2-Chat-70B or ChatGPT. In addition, the result shows that the training directly on the evaluation dataset might be the best option to acquire a task-specific evaluator LLM.

3. Can Prometheus Function as a Reward Model?

Training on an absolute grading scheme could also improve performance on a ranking grading scheme even without directly training on ranking evaluation instances. Moreover, it shows the possibilities of using Prometheus as a reward model for RLHF!

It was truly amazing that it was able to fine-tune an open-source model using high-quality feedback data and show evaluation performance on par with the proprietary model! Thank you so much for conducting this wonderful research!

※ It was just an abstractive summarization containing a lot of missed information. If you want to see more specific information, I hope you can take the time to read the paper carefully.