LLM Judge

In this package, you can use MT-bench questions and prompts to evaluate your models with LLM-as-a-judge. MT-bench is a set of challenging multi-turn open-ended questions for evaluating chat assistants. To automate the evaluation process, we prompt strong LLMs like GPT-4 to act as judges and assess the quality of the models' responses.

Install
Review Pre-Generated Model Answers and Judgments
MT-Bench
Agreement Computation
Release Plan
Citation

Install

git clone https://github.com/lm-sys/FastChat.git
cd FastChat
pip install -e .
pip install openai anthropic ray

Review Pre-Generated Model Answers and Judgments

The model answers and LLM judgments used in the paper are available on Google Drive. You can download them and open a gradio demo to review them.

Download the data:

cd fastchat/llm_judge
pip3 install gdown
gdown --fuzzy https://drive.google.com/file/d/1LNOc7NAc7BXM1LMhRlorsrMu38G9yoHT/view?usp=sharing
tar xzf llm_judge_repo_data.tar.gz

Open a gradio demo for browsing the questions, answers, and judgments.

python qa_browser.py --share

A screenshot:

MT-Bench

How to evaluate a model on MT-bench?

Step 1. Generate model answers to MT-bench questions

python gen_model_answer.py --model-path [MODEL-PATH] --model-id [MODEL-ID]

Note: [MODEL-PATH] is the path to the weights, which can be a local folder or a Hugging Face repo ID.

e.g.,

python gen_model_answer.py --model-path lmsys/fastchat-t5-3b-v1.0 --model-id fastchat-t5-3b-v1.0

The answers will be saved to data/mt_bench/model_answer/[MODEL-ID].jsonl.

You can also specify --num-gpus-per-model for model parallelism (needed for large 65B models) and --num-gpus-total to parallelize answer generation with multiple GPUs.

Step 2. Run GPT-4 judge with pairwise comparison against a baseline (default: gpt-3.5-turbo)

python gen_judgment.py --model-list [LIST-OF-MODEL-ID] --parallel [num-concurrent-api-call]

e.g.,

> python gen_judgment.py --model-list vicuna-13b-v1.2 alpaca-13b gpt-3.5-turbo --parallel 2
Stats:
{
    "bench": "mt_bench",
    "mode": "pairwise-baseline",
    "judge": "gpt-4",
    "baseline": "gpt-3.5-turbo",
    "model_list": [
        "vicuna-13b-v1.2",
        "alpaca-13b",
        "gpt-3.5-turbo",
    ],
    "total_num_questions": 80,
    "total_num_matches": 320,
    "output_path": "data/mt_bench/model_judgment/gpt-4_pair.jsonl"
}
Press Enter to confirm...

The judgments will be saved to data/mt_bench/model_judgment/gpt-4_pair.jsonl

Setp 3. Show win-rate

> python show_result.py
Input file: data/mt_bench/model_judgment/gpt-4_pair.jsonl
                 win  loss  tie  win_rate  loss_rate
model
gpt-4            107     9   44   0.66875    0.05625
claude-v1         64    23   73   0.40000    0.14375
vicuna-13b-v1.2   21    72   67   0.13125    0.45000
alpaca-13b         5   129   26   0.03125    0.80625
llama-13b          1   139   20   0.00625    0.86875

Other grading options

Besides pairwise comparison against a fixed baseline model, we also support two additional grading options:

single: do single-answer grading without pairwise comparison.
pairwise-all: run pairwise comparisons between all model pairs on all questions.

Option 2: Single-answer grading

This option asks GPT-4 to grade and give a score to a single answer without comparison, so it is also a scalable option. For each turn, GPT-4 will give a score on a scale of 10. We then compute the average score on all turns.

Generate GPT-4 judgments

python gen_judgment.py --mode single --model-list [LIST-OF-MODEL-ID] --parallel [num-concurrent-api-call]
Stats:
{
    "bench": "mt_bench",
    "mode": "single",
    "judge": "gpt-4",
    "baseline": null,
    "model_list": [
        "vicuna-13b-v1.2",
        "llama-13b",
        "alpaca-13b",
        "gpt-3.5-turbo",
        "gpt-4",
        "claude-v1"
    ],
    "total_num_questions": 80,
    "total_num_matches": 960,
    "output_path": "data/mt_bench/model_judgment/gpt-4_single.jsonl"
}

The judgments will be saved to data/mt_bench/model_judgment/gpt-4_single.jsonl

Show the MT-bench score

> python show_result.py --mode single
                    score
model
gpt-4            8.937500
gpt-3.5-turbo    7.925000
claude-v1        7.503125
vicuna-13b-v1.2  6.156250
alpaca-13b       4.918750
llama-13b        3.190625

Option 3: Run GPT-4 judge with all pair comparisons

Another option is to run all pairwise comparison on all possible pairs. This could be more expensive when #models increases, but it gives you a more comprehensive information.

> python gen_judgment.py --mode pairwise-all --model-list [LIST-OF-MODEL-ID] --parallel [num-concurrent-api-call]

> python show_result.py --mode pairwise-all
Input file: data/mt_bench/model_judgment/gpt-4_pair.jsonl
                 win  loss  tie  win_rate  loss_rate
model
gpt-4            617    45  138   0.77125    0.05625
claude-v1        445   115  240   0.55625    0.14375
gpt-3.5-turbo    372   198  230   0.46500    0.24750
vicuna-13b-v1.2  242   310  248   0.30250    0.38750
alpaca-13b       104   515  181   0.13000    0.64375
llama-13b         20   617  163   0.02500    0.77125

How to get GPT-3.5/GPT-4/Claude's answer?

python gen_api_answer.py --model [MODEL-NAME] to generate GPT-3.5/4 and Claude's answers.

Agreement Computation

We released 3.3K human annotations for model responses generated by 6 models in response to 80 MT-bench questions. The dataset is available at lmsys/mt_bench_human_judgments. You can use this data to compute the agreement between human and GPT-4.

Download data

wget https://huggingface.co/datasets/lmsys/mt_bench_human_judgments/resolve/main/human_judgments.json
wget https://huggingface.co/datasets/lmsys/mt_bench_human_judgments/resolve/main/gpt4_pair_judgments.json

Compute the agreement between human and GPT-4

python compute_agreement.py --judges gpt4-pair human --votefiles human_judgments.json gpt4_pair_judgments.json

Release Plan

Our current release contains:

The MT-bench questions in data/mt_bench/question.jsonl.
The model answers and GPT-4 judgments available on Google Drive.
The judge prompts in data/judge_prompts.jsonl.
The 3K expert-level human annotation at lmsys/mt_bench_human_judgments.

The next release will include:

All data
- 30K arena conversations with human votes
Other code

Citation

If you find the repository helpful for your study, please consider citing the following paper: "Judging LLM-as-a-judge with MT-Bench and Chatbot Arena":

@misc{zheng2023judging,
      title={Judging LLM-as-a-judge with MT-Bench and Chatbot Arena}, 
      author={Lianmin Zheng and Wei-Lin Chiang and Ying Sheng and Siyuan Zhuang and Zhanghao Wu and Yonghao Zhuang and Zi Lin and Zhuohan Li and Dacheng Li and Eric. P Xing and Hao Zhang and Joseph E. Gonzalez and Ion Stoica},
      year={2023},
      eprint={2306.05685},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}

Spaces:

alexshengzhili
/

calahealthgpt

Runtime error