使用 LLM 作为评判者🧑⚖️进行自动化和多方面的评估
作者: Aymeric Roucher
评估大型语言模型(LLMs)通常是一项困难的任务:由于他们能力广泛,给它们分配的任务时通常应该根据非常泛且松散的要求来判断。例如,AI 对问题的回答可能是:
- 不基于上下文
- 重复、重复、重复
- 语法错误
- 过于冗长,用词过多,导致话语或书面内容过于详细和拖沓
- 不连贯
- …
这些标准的列表还有很多。即使我们有一个有限的列表,每一个标准的衡量都是困难的:“制定一个基于规则的程序来评估输出是非常具有挑战性的。传统的评估指标,基于输出和参考答案之间的相似性(例如,ROUGE、BLEU),对于这些问题也无效。”
✅ 一种强大的解决方案,可以在不需要昂贵人力的前提下,以人类的方式评估输出,就是使用 LLM 作为评判者。 这种方法在 《Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena》 中被介绍 - 推荐阅读这篇文章。
💡 这个想法很简单:让 LLM 为你评分。 🤖✓ 但我们将会看到,它不能直接很好地适配:你需要仔细设置才能得到好的结果。
!pip install huggingface_hub datasets pandas tqdm -q
import re
import pandas as pd
from tqdm.auto import tqdm
from datasets import load_dataset
from huggingface_hub import InferenceClient, notebook_login
tqdm.pandas() # load tqdm's pandas support
pd.set_option("display.max_colwidth", None)
notebook_login()
repo_id = "mistralai/Mixtral-8x7B-Instruct-v0.1"
llm_client = InferenceClient(
model=repo_id,
timeout=120,
)
# Test your LLM client
llm_client.text_generation(prompt="How are you today?", max_new_tokens=20)
1. 准备创建和评估我们的 LLM 评判者
假设你想给 LLM 一个特定任务,比如回答开放式问题。
困难在于,正如我们上面讨论的,衡量答案质量是困难的,例如,精确的字符串匹配会错误地将许多正确但措辞不同的答案标记为错误。
你可以让人类标签员评判输出,但这会花费他们很多时间,如果你想更新模型或问题,你必须重新做一遍。
✅ 在这种情况下,你可以设置一个 LLM 作为评判者。
但是要使用 LLM 作为评判者,你首先需要评估它对模型输出的评分有多可靠。
➡️ 所以第一步将是… 创建一个人工评估数据集。但你只需要为少数示例获取人工标注 - 大约 30 个应该足以对性能有一个好的了解。
每次你想测试你的 LLM 作为评判者时,你都可以重新使用这个数据集。
在我们的案例中,我们将使用 feedbackQA
,它包含每个问题/答案对的 2 个人类评估和评分:使用 30 个示例的样本将代表你的小型评估数据集可能的样子。
ratings = load_dataset("McGill-NLP/feedbackQA")["train"]
ratings = pd.DataFrame(ratings)
ratings["review_1"] = ratings["feedback"].apply(lambda x: x["rating"][0])
ratings["explanation_1"] = ratings["feedback"].apply(lambda x: x["explanation"][0])
ratings["review_2"] = ratings["feedback"].apply(lambda x: x["rating"][1])
ratings["explanation_2"] = ratings["feedback"].apply(lambda x: x["explanation"][1])
ratings = ratings.drop(columns=["feedback"])
# Map scores to numeric values
conversion_dict = {"Excellent": 4, "Acceptable": 3, "Could be Improved": 2, "Bad": 1}
ratings["score_1"] = ratings["review_1"].map(conversion_dict)
ratings["score_2"] = ratings["review_2"].map(conversion_dict)
计算性能基准线是一个好主意:例如,这里可以是两个人类评分者之间的评分一致性,通过他们给出的分数的皮尔逊相关系数来衡量。
>>> print("Correlation between 2 human raters:")
>>> print(f"{ratings['score_1'].corr(ratings['score_2'], method='pearson'):.3f}")
Correlation between 2 human raters: 0.563
两个真人评委之间的相关性并不是那么好。如果你们的真人评分真的很差,这可能意味着评分标准不够清晰。
这意味着我们的“真实情况”包含了一些噪音:因此我们不能期望任何算法评估能够非常接近它。
然而,我们可以减少这种噪音:
- 通过取平均分作为我们的真实情况,而不是任何一个单独的分数,我们应该能够平衡一些不规则性。
- 只选择人类评审员达成一致意见的样本。
在这里,我们将选择最后一个选项,并且只保留两个人类评审员达成一致意见的示例。
# Sample examples
ratings_where_raters_agree = ratings.loc[ratings["score_1"] == ratings["score_2"]]
examples = ratings_where_raters_agree.groupby("score_1").sample(7, random_state=1214)
examples["human_score"] = examples["score_1"]
# Visualize 1 sample for each score
display(examples.groupby("human_score").first())
2. 创建我们的 LLM 评判者
我们使用一个基本提示来构建我们的 LLM 评判者,包含以下元素:
- 任务描述
- 标度描述:
最小值
,最大值
,值类型(这里为浮点数
) - 输出格式的解释
- 一个答案的开头,尽可能引导 LLM
JUDGE_PROMPT = """
You will be given a user_question and system_answer couple.
Your task is to provide a 'total rating' scoring how well the system_answer answers the user concerns expressed in the user_question.
Give your answer as a float on a scale of 0 to 10, where 0 means that the system_answer is not helpful at all, and 10 means that the answer completely and helpfully addresses the question.
Provide your feedback as follows:
Feedback:::
Total rating: (your rating, as a float between 0 and 10)
Now here are the question and answer.
Question: {question}
Answer: {answer}
Feedback:::
Total rating: """
examples["llm_judge"] = examples.progress_apply(
lambda x: llm_client.text_generation(
prompt=JUDGE_PROMPT.format(question=x["question"], answer=x["answer"]),
max_new_tokens=1000,
),
axis=1,
)
def extract_judge_score(answer: str, split_str: str = "Total rating:") -> int:
try:
if split_str in answer:
rating = answer.split(split_str)[1]
else:
rating = answer
digit_groups = [el.strip() for el in re.findall(r"\d+(?:\.\d+)?", rating)]
return float(digit_groups[0])
except Exception as e:
print(e)
return None
examples["llm_judge_score"] = examples["llm_judge"].apply(extract_judge_score)
# Rescale the score given by the LLM on the same scale as the human score
examples["llm_judge_score"] = (examples["llm_judge_score"] / 10) + 1
>>> print("Correlation between LLM-as-a-judge and the human raters:")
>>> print(f"{examples['llm_judge_score'].corr(examples['human_score'], method='pearson'):.3f}")
Correlation between LLM-as-a-judge and the human raters: 0.567
这已经不错了,考虑到两个随机、独立变量之间的皮尔逊相关系数会是 0!
但我们很容易做得更好。🔝
3. 改进 LLM 评判者
正如 Aparna Dhinakaran 所说的,LLM 在评估连续范围的输出方面表现不佳。 这篇文章为我们提供了一些构建更好提示的最佳实践:
- ⏳ 增加思考时间,在最终答案前添加一个
评估
字段。 - 🔢 使用较小的整数刻度,比如 1-4 或 1-5,而不是我们之前使用的大范围浮点刻度。
- 👩🏫 提供一个指导性的刻度。
- 我们甚至添加了一个激励 LLM 的“胡萝卜”(这里指给它一点额外的激励,就像给人一个奖励一样。)!
IMPROVED_JUDGE_PROMPT = """
You will be given a user_question and system_answer couple.
Your task is to provide a 'total rating' scoring how well the system_answer answers the user concerns expressed in the user_question.
Give your answer on a scale of 1 to 4, where 1 means that the system_answer is not helpful at all, and 4 means that the system_answer completely and helpfully addresses the user_question.
Here is the scale you should use to build your answer:
1: The system_answer is terrible: completely irrelevant to the question asked, or very partial
2: The system_answer is mostly not helpful: misses some key aspects of the question
3: The system_answer is mostly helpful: provides support, but still could be improved
4: The system_answer is excellent: relevant, direct, detailed, and addresses all the concerns raised in the question
Provide your feedback as follows:
Feedback:::
Evaluation: (your rationale for the rating, as a text)
Total rating: (your rating, as a number between 1 and 4)
You MUST provide values for 'Evaluation:' and 'Total rating:' in your answer.
Now here are the question and answer.
Question: {question}
Answer: {answer}
Provide your feedback. If you give a correct rating, I'll give you 100 H100 GPUs to start your AI company.
Feedback:::
Evaluation: """
examples["llm_judge_improved"] = examples.progress_apply(
lambda x: llm_client.text_generation(
prompt=IMPROVED_JUDGE_PROMPT.format(question=x["question"], answer=x["answer"]),
max_new_tokens=500,
),
axis=1,
)
examples["llm_judge_improved_score"] = examples["llm_judge_improved"].apply(extract_judge_score)
>>> print("Correlation between LLM-as-a-judge and the human raters:")
>>> print(f"{examples['llm_judge_improved_score'].corr(examples['human_score'], method='pearson'):.3f}")
Correlation between LLM-as-a-judge and the human raters: 0.843
通过对提示的少量调整,相关性提高了近 30%(其中几个百分点是因为我无耻地给了 LLM 一个小提示,我在此声明该提示不具有法律约束力)。
相当令人印象深刻!👏
让我们展示一些 LLM 评判者的错误来分析它们:
errors = pd.concat(
[
examples.loc[examples["llm_judge_improved_score"] > examples["human_score"]].head(1),
examples.loc[examples["llm_judge_improved_score"] < examples["human_score"]].head(2),
]
)
display(
errors[
[
"question",
"answer",
"human_score",
"explanation_1",
"llm_judge_improved_score",
"llm_judge_improved",
]
]
)
我们的 LLM 评判者的不一致之处很微小:总体来看,我们的系统似乎已经达到了不错的性能水平!
4. 我们如何进一步提高 LLM 评判者的水平?
🎯 你永远达不到 100%: 首先,我们的人类基准肯定包含一些噪音,所以即使有完美的 LLM 评判者,一致性和相关性也不可能达到 100%。
🧭 提供参考信息: 如果你每个问题都有一个参考答案,你绝对应该在 LLM 评判者的提示中提供这些信息,以获得更好的结果!
▶️ 提供少量示例: 在提示中添加一些问题和基准评估的少量示例可以改善结果。 (我这里尝试了,但在这个案例中并没有改善结果,所以我省略了,但这可能对你的数据集有效!)
➕ 累加刻度: 当评判可以分解为原子性标准时,使用累加刻度可以进一步改善结果:如下所示 👇
ADDITIVE_PROMPT = """
(...)
- Award 1 point if the answer is related to the question.
- Give 1 additional point if the answer is clear and precise.
- Provide 1 further point if the answer is true.
- One final point should be awarded if the answer provides additional resources to support the user.
...
"""
总结
今天的内容就到这里,恭喜你跟到这里!🥳
我必须得走了,一群奇奇怪怪的人正在敲我的门,声称他们是代表 Mixtral 来收取 H100s 的。🤔
< > Update on GitHub