--- license: mit datasets: - openai/summarize_from_feedback - openai/webgpt_comparisons - Dahoas/instruct-synthetic-prompt-responses - Anthropic/hh-rlhf - lmsys/chatbot_arena_conversations - openbmb/UltraFeedback metrics: - accuracy tags: - reward_model - reward-model - RLHF - evaluation - llm - instruction - reranking language: - en pipeline_tag: text-generation --- # Pairwise Reward Model for LLMs (PairRM) from LLM-Blender - Github: [https://github.com/yuchenlin/LLM-Blender](https://github.com/yuchenlin/LLM-Blender) - Paper: [https://arxiv.org/abs/2306.02561](https://arxiv.org/abs/2306.02561) - Space Demo: [https://huggingface.co/spaces/llm-blender/LLM-Blender](https://huggingface.co/spaces/llm-blender/LLM-Blender) ## Introduction Pairwise Reward Model (PairRM) takes an instruction and a **pair** of output candidates as the input, and output a score for each candidate to measure their **relative** quality. Unlike the other RMs that encode and score each candidate respectively, PairRM takes a pair of candidates and compares them side-by-side to indentify the subtle differences between them. PairRM can be used to (re-)rank a list of candidate outputs and thus can be used an LLM evaluator to efficiently assess the quality of LLMs in local environment. PairRM can also be used to enhance the decoding by `best-of-n sampling` (i.e., reranking N sampled outputs). Apart from that, one can also use PairRM to ## Installation Since PairRanker contains some custom layers and tokens. We recommend use PairRM with our llm-blender code API. - First install `llm-blender` ```bash pip install git+https://github.com/yuchenlin/LLM-Blender.git ``` - Then load pairranker with the following code: ```python import llm_blender blender = llm_blender.Blender() blender.loadranker("llm-blender/PairRM") # load PairRM ``` ## Usage ### Use case 1: Compare responses (Quality Evaluator) - Then you can rank candidate responses with the following function ```python inputs = ["input1", "input2"] candidates_texts = [["candidate1 for input1", "candidatefor input1"], ["candidate1 for input2", "candidate2 for input2"]] ranks = blender.rank(inputs, candidates_texts, return_scores=False, batch_size=2) # ranks is a list of ranks where ranks[i][j] represents the ranks of candidate-j for input-i ``` - Directly compare two candidate responses ```python candidates_A = [cands[0] for cands in candidates] candidates_B = [cands[1] for cands in candidates] comparison_results = blender.compare(inputs, candidates_A, candidates_B) # comparison_results is a list of bool, where element[i] denotes whether candidates_A[i] is better than candidates_B[i] for inputs[i] ``` - Directly compare two multi-turn conversations given that user's query in each turn are fiexed and responses are different. ```python conv1 = [ { "content": "hello", "role": "USER" }, { "content": "", "role": "ASSISTANT" }, ... ] conv2 = [ { "content": "hello", "role": "USER" }, { "content": "", "role": "ASSISTANT" }, ... ] comparison_results = blender.compare_conversations([conv1], [conv2]) # comparison_results is a list of bool, where each element denotes whether all the responses in conv1 together is better than that of conv2 ``` ### Use case 2: Best-of-n sampling (Decoding Enhancing) **Best-of-n Sampling**, aka, rejection sampling, is a strategy to enhance the response quality by selecting the one that was ranked highest by the reward model (Learn more at[OpenAI WebGPT section 3.2](https://arxiv.org/pdf/2112.09332.pdf) and [OpenAI Blog](https://openai.com/research/measuring-goodharts-law)). Best-of-n sampling is a easy way to imporve your llm power with just a few lines of code. An example of applying on zephyr is as follows. ```python from transformers import AutoTokenizer, AutoModelForCausalLM tokenizer = AutoTokenizer.from_pretrained("HuggingFaceH4/zephyr-7b-beta") model = AutoModelForCausalLM.from_pretrained("HuggingFaceH4/zephyr-7b-beta", device_map="auto") inputs = [...] # your list of inputs system_message = { "role": "system", "content": "You are a friendly chatbot who always responds in the style of a pirate", } messages = [ [ system_message, {"role": "user", "content": _input}, ] for _input in zip(inputs) ] prompts = [tokenizer.apply_chat_template(m, tokenize=False, add_generation_prompt=True) for m in messages] outputs = blender.best_of_n_generate(model, tokenizer, prompts, n=10) print("### Prompt:") print(prompts[0]) print("### best-of-n generations:") print(outputs[0]) ``` ### Use case 3: RLHF PairRM has been trained on various high-quality and large-scale dataset with human preference annotations and exhibits great correlation with human preferences with an extremly small model size (0.4B), approching the performance of GPT-4. We believe PairRM will power the alignment of LLM in an efficient and effective way. With a `blender.compare()` function, you can easily apply PairRM to poopular RLHF toolkits like [trl](https://huggingface.co/docs/trl/index). **🔥 Check more details on our example jupyter notebook usage: [`blender_usage.ipynb`](https://github.com/yuchenlin/LLM-Blender/blob/main/blender_usage.ipynb)** Learn more in our LLM-Blender Github [README.md](https://github.com/yuchenlin/LLM-Blender#rank-and-fusion) ## Statistics ### Context length | PairRanker type | Source max length | Candidate max length | Total max length | |:-----------------:|:-----------------:|----------------------|------------------| | [pair-ranker](https://huggingface.co/llm-blender/pair-ranker) | 128 | 128 | 384 | | [PairRM](https://huggingface.co/llm-blender/pair-reward-model/) (This model) | 1224 | 412 | 2048 | ### Performance PairRM has been trained on various high-quality and large-scale dataset with human preference annotations and exhibits great correlation with human preferences with an extremly small model size (0.4B), approching the performance of GPT-4. We test the pairwise comparison on - [Auto-J pairwise testdata](https://github.com/GAIR-NLP/auto-j#pairwise-response-comparison) - [HHH-alignment](https://huggingface.co/datasets/HuggingFaceH4/hhh_alignment) - [MT-bench-human-judgements](https://huggingface.co/datasets/lmsys/mt_bench_human_judgments) #### Auto-J Pairwise test data performance | Model | Summ | Exam | Code | Rewriting | Crea W | Func W | Comm | NLP | Overall | |:---------------------:|:---------:|:---------:|:---------:|:---------:|:---------:|:---------:|:-----:|:--------:|:---------:| | Closed -source Models | | ChatGPT | 33.3 | 40.3 | 36.6 | 31.6 | 48.2 | 40.4 | 47.6 | 45.8 | 42.7 | | Claude -2 | 30.6 | 36.1 | 41.7 | 34.2 | 48.1 | 42.5 | 40.6 | 48.5 | 42.4 | | GPT -4 | 59.7 | 51.4 | 69.2 | 58.3 | 66.7 | 60.4 | 58.3 | 65.2 | 61.9 | | Open -source Models | | SteamSHP | 33.3 | 29.2 | 26.7 | 33.3 | 40.7 | 31.3 | 51.4 | 51.9 | 40.6 | | PandaLM | 29.2 | 33.3 | 31.7 | 23.3 | 43.5 | 32.9 | 44.8 | 48.9 | 38.9 | | LLaMA -2-Chat -13B | 20.8 | 27.8 | 19.2 | 20 | 31.5 | 27.5 | 35.8 | 31.8 | 29 | | Vicuna -13B-v1.5 | 30.6 | 23.6 | 35 | 28.3 | 36.1 | 37.5 | 45.5 | 39.8 | 37.3 | | WizardLM -13B-v1.2 | 22.2 | 20.8 | 32.5 | 19.2 | 28.7 | 25.4 | 29.2 | 33 | 27.8 | | LLAMA -2-chat -70B | 34.7 | 33.3 | 36.7 | 35.8 | 51.4 | 54.2 | 47.2 | 47.7 | 45.9 | | AUTO -J (13b) | 45.8 | 38.9 | 59.2 | 47.5 | 54.6 | 57.1 | **58** | 57.6 | 54.8 | | **PairRM (0.4b)** | **56.94** | **52.78** | **58.33** | **55.83** | **61.57** | **59.17** | 57.64 | **62.5** | **59.05** | #### HHH-Alignment and MT-bench human judgements | Evaluator LM | HHH ALIGNMENT | | | | | MT BENCH HUMAN JUDG . | |:-------------------------:|:-------------:|:---------:|:---------:|:--------:|:-----------:|:---------------------:| | | Help . | Harm . | Hon . | Other | Total Avg . | Human Preference | | RANDOM | 50 | 50 | 50 | 50 | 50 | 34.26 | | STANFORDNLP REWARD MODEL | 69.49 | 60.34 | 52.46 | 51.16 | 58.82 | 44.79 | | ALMOST REWARD MODEL | 74.58 | 67.24 | 78.69 | 86.05 | 76.02 | 49.9 | | LLAMA2 -CHAT 7B | 66.1 | 81.03 | 70.49 | 74.42 | 72.85 | 51.78 | | LLAMA2 -CHAT 13B | 74.58 | 87.93 | 55.74 | 79.07 | 73.76 | 52.34 | | LLAMA2 -CHAT 70B | 66.1 | **89.66** | 67.21 | 74.42 | 74.21 | 53.67 | | LLAMA2 -CHAT 13B+COARSE . | 68.74 | 68.97 | 65.57 | 67.44 | 67.42 | 46.89 | | GPT -3.5-TURBO -0613 | 76.27 | 87.93 | 67.21 | 86.05 | 78.73 | 57.12 | | PROMETHEUS 7B | 69.49 | 84.48 | 78.69 | 90.7 | 80.09 | 55.14 | | PROMETHEUS 13B | 81.36 | 82.76 | 75.41 | 76.74 | 79.19 | 57.72 | | **PairRM (0.4b)** | **84.75** | 84.48 | **80.33** | **90.7** | **84.62** | **59** | | GPT -4-0613 | 91.53 | 93.1 | 85.25 | 83.72 | 88.69 | 63.87 | **While PairRM is a extremely small model (0.4B) based on deberta, the pairwise comparison aggrement performance approches GPT-4's performance!** Two reasons to attribute: - Our PairRM specically designed model arch for pairwise comparison through bidirectional attention (See LLM-blender paper for more details) - The high-quality and large-scale human preference annotation data it was train on (see training dataset list on this hugging face page) ## Citation & Credits If you are using PairRM in your research, please cite LLM-blender. ```bibtex @inproceedings{llm-blender-2023, title = "LLM-Blender: Ensembling Large Language Models with Pairwise Comparison and Generative Fusion", author = "Jiang, Dongfu and Ren, Xiang and Lin, Bill Yuchen", booktitle = "Proceedings of the 61th Annual Meeting of the Association for Computational Linguistics (ACL 2023)", year = "2023" } ```