Spaces:
Runtime error
Runtime error
File size: 7,812 Bytes
e72aedf |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 |
# LLM Judge
| [Paper](https://arxiv.org/abs/2306.05685) | [Demo](https://huggingface.co/spaces/lmsys/mt-bench) | [Leaderboard](https://chat.lmsys.org/?leaderboard) | [Human Annotation Dataset](https://huggingface.co/datasets/lmsys/mt_bench_human_judgments) |
In this package, you can use MT-bench questions and prompts to evaluate your models with LLM-as-a-judge.
MT-bench is a set of challenging multi-turn open-ended questions for evaluating chat assistants.
To automate the evaluation process, we prompt strong LLMs like GPT-4 to act as judges and assess the quality of the models' responses.
## Contents
- [Install](#install)
- [Review Pre-Generated Model Answers and Judgments](#review-pre-generated-model-answers-and-judgments)
- [MT-Bench](#mt-bench)
- [Agreement Computation](#agreement-computation)
- [Release Plan](#release-plan)
- [Citation](#citation)
## Install
```
git clone https://github.com/lm-sys/FastChat.git
cd FastChat
pip install -e .
pip install openai anthropic ray
```
## Review Pre-Generated Model Answers and Judgments
The model answers and LLM judgments used in the paper are available on Google Drive.
You can download them and open a gradio demo to review them.
- Download the data:
```
cd fastchat/llm_judge
pip3 install gdown
gdown --fuzzy https://drive.google.com/file/d/1LNOc7NAc7BXM1LMhRlorsrMu38G9yoHT/view?usp=sharing
tar xzf llm_judge_repo_data.tar.gz
```
- Open a gradio [demo](https://huggingface.co/spaces/lmsys/mt-bench) for browsing the questions, answers, and judgments.
```
python qa_browser.py --share
```
A screenshot:
<img src="../../assets/qa_browser.png" width="90%">
## MT-Bench
### How to evaluate a model on MT-bench?
#### Step 1. Generate model answers to MT-bench questions
```
python gen_model_answer.py --model-path [MODEL-PATH] --model-id [MODEL-ID]
```
Note: `[MODEL-PATH]` is the path to the weights, which can be a local folder or a Hugging Face repo ID.
e.g.,
```
python gen_model_answer.py --model-path lmsys/fastchat-t5-3b-v1.0 --model-id fastchat-t5-3b-v1.0
```
The answers will be saved to `data/mt_bench/model_answer/[MODEL-ID].jsonl`.
You can also specify `--num-gpus-per-model` for model parallelism (needed for large 65B models) and `--num-gpus-total` to parallelize answer generation with multiple GPUs.
#### Step 2. Run GPT-4 judge with pairwise comparison against a baseline (default: gpt-3.5-turbo)
```
python gen_judgment.py --model-list [LIST-OF-MODEL-ID] --parallel [num-concurrent-api-call]
```
e.g.,
```
> python gen_judgment.py --model-list vicuna-13b-v1.2 alpaca-13b gpt-3.5-turbo --parallel 2
Stats:
{
"bench": "mt_bench",
"mode": "pairwise-baseline",
"judge": "gpt-4",
"baseline": "gpt-3.5-turbo",
"model_list": [
"vicuna-13b-v1.2",
"alpaca-13b",
"gpt-3.5-turbo",
],
"total_num_questions": 80,
"total_num_matches": 320,
"output_path": "data/mt_bench/model_judgment/gpt-4_pair.jsonl"
}
Press Enter to confirm...
```
The judgments will be saved to `data/mt_bench/model_judgment/gpt-4_pair.jsonl`
#### Setp 3. Show win-rate
```
> python show_result.py
Input file: data/mt_bench/model_judgment/gpt-4_pair.jsonl
win loss tie win_rate loss_rate
model
gpt-4 107 9 44 0.66875 0.05625
claude-v1 64 23 73 0.40000 0.14375
vicuna-13b-v1.2 21 72 67 0.13125 0.45000
alpaca-13b 5 129 26 0.03125 0.80625
llama-13b 1 139 20 0.00625 0.86875
```
### Other grading options
Besides pairwise comparison against a fixed baseline model, we also support two additional grading options:
- `single`: do single-answer grading without pairwise comparison.
- `pairwise-all`: run pairwise comparisons between all model pairs on all questions.
#### Option 2: Single-answer grading
This option asks GPT-4 to grade and give a score to a single answer without comparison, so it is also a scalable option.
For each turn, GPT-4 will give a score on a scale of 10. We then compute the average score on all turns.
- Generate GPT-4 judgments
```
python gen_judgment.py --mode single --model-list [LIST-OF-MODEL-ID] --parallel [num-concurrent-api-call]
Stats:
{
"bench": "mt_bench",
"mode": "single",
"judge": "gpt-4",
"baseline": null,
"model_list": [
"vicuna-13b-v1.2",
"llama-13b",
"alpaca-13b",
"gpt-3.5-turbo",
"gpt-4",
"claude-v1"
],
"total_num_questions": 80,
"total_num_matches": 960,
"output_path": "data/mt_bench/model_judgment/gpt-4_single.jsonl"
}
```
The judgments will be saved to `data/mt_bench/model_judgment/gpt-4_single.jsonl`
- Show the MT-bench score
```
> python show_result.py --mode single
score
model
gpt-4 8.937500
gpt-3.5-turbo 7.925000
claude-v1 7.503125
vicuna-13b-v1.2 6.156250
alpaca-13b 4.918750
llama-13b 3.190625
```
#### Option 3: Run GPT-4 judge with all pair comparisons
Another option is to run all pairwise comparison on all possible pairs.
This could be more expensive when #models increases, but it gives you a more comprehensive information.
```
> python gen_judgment.py --mode pairwise-all --model-list [LIST-OF-MODEL-ID] --parallel [num-concurrent-api-call]
```
```
> python show_result.py --mode pairwise-all
Input file: data/mt_bench/model_judgment/gpt-4_pair.jsonl
win loss tie win_rate loss_rate
model
gpt-4 617 45 138 0.77125 0.05625
claude-v1 445 115 240 0.55625 0.14375
gpt-3.5-turbo 372 198 230 0.46500 0.24750
vicuna-13b-v1.2 242 310 248 0.30250 0.38750
alpaca-13b 104 515 181 0.13000 0.64375
llama-13b 20 617 163 0.02500 0.77125
```
### How to get GPT-3.5/GPT-4/Claude's answer?
- `python gen_api_answer.py --model [MODEL-NAME]` to generate GPT-3.5/4 and Claude's answers.
## Agreement Computation
We released 3.3K human annotations for model responses generated by 6 models in response to 80 MT-bench questions. The dataset is available at [lmsys/mt_bench_human_judgments](https://huggingface.co/datasets/lmsys/mt_bench_human_judgments).
You can use this data to compute the agreement between human and GPT-4.
### Download data
```
wget https://huggingface.co/datasets/lmsys/mt_bench_human_judgments/resolve/main/human_judgments.json
wget https://huggingface.co/datasets/lmsys/mt_bench_human_judgments/resolve/main/gpt4_pair_judgments.json
```
### Compute the agreement between human and GPT-4
```
python compute_agreement.py --judges gpt4-pair human --votefiles human_judgments.json gpt4_pair_judgments.json
```
## Release Plan
Our current release contains:
- The MT-bench questions in [data/mt_bench/question.jsonl](data/mt_bench/question.jsonl).
- The model answers and GPT-4 judgments available on Google Drive.
- The judge prompts in [data/judge_prompts.jsonl](data/judge_prompts.jsonl).
- The 3K expert-level human annotation at [lmsys/mt_bench_human_judgments](https://huggingface.co/datasets/lmsys/mt_bench_human_judgments).
The next release will include:
- All data
- 30K arena conversations with human votes
- Other code
## Citation
If you find the repository helpful for your study, please consider citing the following [paper](https://arxiv.org/abs/2306.05685): "Judging LLM-as-a-judge with MT-Bench and Chatbot Arena":
```
@misc{zheng2023judging,
title={Judging LLM-as-a-judge with MT-Bench and Chatbot Arena},
author={Lianmin Zheng and Wei-Lin Chiang and Ying Sheng and Siyuan Zhuang and Zhanghao Wu and Yonghao Zhuang and Zi Lin and Zhuohan Li and Dacheng Li and Eric. P Xing and Hao Zhang and Joseph E. Gonzalez and Ion Stoica},
year={2023},
eprint={2306.05685},
archivePrefix={arXiv},
primaryClass={cs.CL}
}
```
|