File size: 19,181 Bytes
cb453fc 02971ea cb453fc 02971ea 8841a8a 02971ea f9000bd 02971ea 342b809 02971ea 342b809 02971ea 342b809 02971ea 0d37d21 02971ea bfd2da5 02971ea bfd2da5 02971ea 28afd59 bfd2da5 096ad41 02971ea |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 |
---
license: mit
datasets:
- openai/summarize_from_feedback
- openai/webgpt_comparisons
- Dahoas/instruct-synthetic-prompt-responses
- Anthropic/hh-rlhf
- lmsys/chatbot_arena_conversations
- openbmb/UltraFeedback
metrics:
- accuracy
tags:
- reward_model
- reward-model
- RLHF
- evaluation
- llm
- instruction
- reranking
language:
- en
pipeline_tag: text-generation
---
**This is the hugging face compatible version of [llm-blender/PairRM](https://huggingface.co/llm-blender/PairRM)**,
which can be loaded directly with [`DebertaV2PairRM`](https://github.com/yuchenlin/LLM-Blender/blob/main/llm_blender/pair_ranker/pairrm.py):
```python
import os
os.environ["CUDA_VISIBLE_DEVICES"] = "0"
from llm_blender.pair_ranker.pairrm import DebertaV2PairRM
from transformers import AutoTokenizer
from typing import List
pairrm = DebertaV2PairRM.from_pretrained("llm-blender/PairRM-hf", device_map="cuda:0").eval()
tokenizer = AutoTokenizer.from_pretrained('llm-blender/PairRM-hf')
source_prefix = "<|source|>"
cand1_prefix = "<|candidate1|>"
cand2_prefix = "<|candidate2|>"
inputs = ["hello!", "I love you!"]
candidates_A = ["hi!", "I hate you!"]
candidates_B = ["f**k off!", "I love you, too!"]
def tokenize_pair(sources:List[str], candidate1s:List[str], candidate2s:List[str], source_max_length=1224, candidate_max_length=412):
ids = []
assert len(sources) == len(candidate1s) == len(candidate2s)
max_length = source_max_length + 2 * candidate_max_length
for i in range(len(sources)):
source_ids = tokenizer.encode(source_prefix + sources[i], max_length=source_max_length, truncation=True)
candidate_max_length = (max_length - len(source_ids)) // 2
candidate1_ids = tokenizer.encode(cand1_prefix + candidate1s[i], max_length=candidate_max_length, truncation=True)
candidate2_ids = tokenizer.encode(cand2_prefix + candidate2s[i], max_length=candidate_max_length, truncation=True)
ids.append(source_ids + candidate1_ids + candidate2_ids)
encodings = tokenizer.pad({"input_ids": ids}, return_tensors="pt", padding="max_length", max_length=max_length)
return encodings
encodings = tokenize_pair(inputs, candidates_A, candidates_B)
encodings = {k:v.to(pairrm.device) for k,v in encodings.items()}
outputs = pairrm(**encodings)
logits = outputs.logits.tolist()
comparison_results = outputs.logits > 0
print(logits)
# [1.9003021717071533, -1.2547134160995483]
print(comparison_results)
# tensor([ True, False], device='cuda:0'), which means whether candidate A is better than candidate B for each input
```
You can also copy the simple definition of [`DebertaV2PairRM`](https://github.com/yuchenlin/LLM-Blender/blob/main/llm_blender/pair_ranker/pairrm.py) code as your local file,
instead of importing it from the `llm-blender` package
The above code produces exactly the same results as the following code using the original LLM-blender wrapper:
```python
import os
os.environ["CUDA_VISIBLE_DEVICES"] = "0"
import llm_blender
blender = llm_blender.Blender()
# Load Ranker
blender.loadranker("llm-blender/PairRM") # load ranker checkpoint
inputs = ["hello!", "I love you!"]
candidates_A = ["hi!", "I hate you!"]
candidates_B = ["f**k off!", "I love you, too!"]
logits = blender.compare(inputs, candidates_A, candidates_B, return_logits=True, mode="[A,B]")
comparison_results = logits > 0
print(logits)
# [ 1.9 -1.255]
print(comparison_results)
# tensor([ True, False], device='cuda:0'), which means whether candidate A is better than candidate B for each input
```
**We still recommend using the llm-blender wrapper to use the PairRM, as many useful application functions have been implemented to support various scenarios, such as rank, and conversation comparisons, best-of-n-sampling, etc.**
You can also easily compare two conversations like the followings:
```python
def tokenize_conv_pair(convAs: List[str], convBs: List[str]):
"""Compare two conversations by takeing USER turns as inputs and ASSISTANT turns as candidates
Multi-turn conversations comparison is also supportted.
a conversation format is:
```python
[
{
"content": "hello",
"role": "USER"
},
{
"content": "hi",
"role": "ASSISTANT"
},
...
]
```
Args:
convAs (List[List[dict]]): List of conversations
convAs (List[List[dict]]): List of conversations
"""
for c in convAs + convBs:
assert len(c) % 2 == 0, "Each conversation must have even number of turns"
assert all([c[i]['role'] == 'USER' for i in range(0, len(c), 2)]), "Each even turn must be USER"
assert all([c[i]['role'] == 'ASSISTANT' for i in range(1, len(c), 2)]), "Each odd turn must be ASSISTANT"
# check conversations correctness
assert len(convAs) == len(convBs), "Number of conversations must be the same"
for c_a, c_b in zip(convAs, convBs):
assert len(c_a) == len(c_b), "Number of turns in each conversation must be the same"
assert all([c_a[i]['content'] == c_b[i]['content'] for i in range(0, len(c_a), 2)]), "USER turns must be the same"
instructions = ["Finish the following coversation in each i-th turn by filling in <Response i> with your response."] * len(convAs)
inputs = [
"\n".join([
"USER: " + x[i]['content'] +
f"\nAssistant: <Response {i//2+1}>" for i in range(0, len(x), 2)
]) for x in convAs
]
cand1_texts = [
"\n".join([
f"<Response {i//2+1}>: " + x[i]['content'] for i in range(1, len(x), 2)
]) for x in convAs
]
cand2_texts = [
"\n".join([
f"<Response {i//2+1}>: " + x[i]['content'] for i in range(1, len(x), 2)
]) for x in convBs
]
inputs = [inst + inp for inst, inp in zip(instructions, inputs)]
encodings = tokenize_pair(inputs, cand1_texts, cand2_texts)
return encodings
```
# Pairwise Reward Model for LLMs (PairRM) from LLM-Blender
- Github: [https://github.com/yuchenlin/LLM-Blender](https://github.com/yuchenlin/LLM-Blender)
- Paper: [https://arxiv.org/abs/2306.02561](https://arxiv.org/abs/2306.02561)
- Space Demo: [https://huggingface.co/spaces/llm-blender/LLM-Blender](https://huggingface.co/spaces/llm-blender/LLM-Blender)
## Introduction
Pairwise Reward Model (PairRM) takes an instruction and a **pair** of output candidates as the input,
and output a score for each candidate to measure their **relative** quality.
PairRM can be used to (re-)rank a list of candidate outputs and thus can be used an LLM evaluator to efficiently assess the quality of LLMs in local environment.
PairRM can also be used to enhance the decoding by `best-of-n sampling` (i.e., reranking N sampled outputs).
Apart from that, one can also use PairRM to further align instruction-tuned LLMs with RLHF methods.
Unlike the other RMs that encode and score each candidate respectively,
PairRM takes a pair of candidates and compares them side-by-side to indentify the subtle differences between them.
Also, PairRM is based on [`microsoft/deberta-v3-large`](https://huggingface.co/microsoft/deberta-v3-large), and thus it is super efficient: **0.4B**.
We trained PairRM on a diverse collection of six human-preference datasets (see more [here](https://huggingface.co/llm-blender/PairRM#training-datasets)).
PairRM is part of the LLM-Blender project (ACL 2023). Please see our [paper](https://arxiv.org/abs/2306.02561) above to know more.
## Installation
- First install `llm-blender`
```bash
pip install git+https://github.com/yuchenlin/LLM-Blender.git
```
- Then load PairRM:
```python
import llm_blender
blender = llm_blender.Blender()
blender.loadranker("llm-blender/PairRM") # load PairRM
```
## Usage
### Use Case 1: Comparing/Ranking output candidates given an instruction
- Ranking a list candidate responses
```python
inputs = ["hello, how are you!", "I love you!"]
candidates_texts = [["get out!", "hi! I am fine, thanks!", "bye!"],
["I love you too!", "I hate you!", "Thanks! You're a good guy!"]]
ranks = blender.rank(inputs, candidates_texts, return_scores=False, batch_size=1)
# ranks is a list of ranks
# ranks[i][j] represents the ranks of candidate-j for input-i
"""
ranks -->
array([[3, 1, 2], # it means "hi! I am fine, thanks!" ranks the 1st, "bye" ranks the 2nd, and "get out!" ranks the 3rd.
[1, 3, 2]], # it means "I love you too"! ranks the the 1st, and "I hate you!" ranks the 3rd.
dtype=int32)
"""
```
- Directly comparing two candidate responses
```python
inputs = ["hello!", "I love you!"]
candidates_A = ["hi!", "I hate you!"]
candidates_B = ["f**k off!", "I love you, too!"]
comparison_results = blender.compare(inputs, candidates_A, candidates_B)
# comparison_results is a list of bool, where comparison_results[i] denotes
# whether candidates_A[i] is better than candidates_B[i] for inputs[i]
# Example: comparison_results[0]--> True
```
<details><summary> Comparing two multi-turn conversations. </summary>
```python
conv1 = [
{
"content": "hello",
"role": "USER"
},
{
"content": "[assistant1‘s response 1]",
"role": "ASSISTANT"
},
...
]
conv2 = [
{
"content": "hello",
"role": "USER"
},
{
"content": "[assistant2's response 1]",
"role": "ASSISTANT"
},
...
]
comparison_results = blender.compare_conversations([conv1], [conv2])
# comparison_results is a list of bool, where each element denotes whether all the responses in conv1 together is better than that of conv2
```
</details>
### Use Case 2: Best-of-n Sampling (Decoding Enhancment)
**Best-of-n Sampling**, aka, rejection sampling, is a strategy to enhance the response quality by selecting the one that was ranked highest by the reward model
(see more in [OpenAI WebGPT section 3.2](https://arxiv.org/pdf/2112.09332.pdf) and [OpenAI Blog](https://openai.com/research/measuring-goodharts-law)).
Best-of-n sampling with PairRM is a very easy way to imporve your LLMs with only a few changes of your inference code:
```python
# loading models
import llm_blender
from transformers import AutoTokenizer, AutoModelForCausalLM
tokenizer = AutoTokenizer.from_pretrained("HuggingFaceH4/zephyr-7b-beta")
model = AutoModelForCausalLM.from_pretrained("HuggingFaceH4/zephyr-7b-beta", device_map="auto")
system_message = {"role": "system", "content": "You are a friendly chatbot."}
# formatting your inputs
inputs = ["can you tell me a joke about OpenAI?"]
messages = [[system_message, {"role": "user", "content": _input}] for _input in inputs]
prompts = [tokenizer.apply_chat_template(m, tokenize=False, add_generation_prompt=True) for m in messages]
# Conventional generation method
input_ids = tokenizer(prompts[0], return_tensors="pt").input_ids
sampled_outputs = model.generate(input_ids, do_sample=True, top_k=50, top_p=0.95, num_return_sequences=1)
print(tokenizer.decode(sampled_outputs[0][len(input_ids[0]):], skip_special_tokens=False))
# --> The output could be a bad case such as a very short one, e.g., `Sure`
# PairRM for best-of-n sampling
blender = llm_blender.Blender()
blender.loadranker("llm-blender/PairRM") # load ranker checkpoint
outputs = blender.best_of_n_generate(model, tokenizer, prompts, n=10)
print("### Prompt:\n", prompts[0])
print("### best-of-n generations:\n", outputs[0])
# --> The output will be much more stable and consistently better than single sampling, for example:
"""
Sure, here's a joke about OpenAI:
Why did OpenAI decide to hire a mime as their new AI researcher?
Because they wanted someone who could communicate complex ideas without making a sound!
(Note: This is a joke, not a reflection of OpenAI's actual hiring practices.)
"""
```
### Use case 3: RLHF
PairRM has been trained on various high-quality and large-scale datasets with human preference annotations
and shown great correlation with human preferences with an extremely small model size (0.4B),
approching the performance of GPT-4.
PairRM will better help the future alignment of LLMs in a more efficient and effective way.
With a `blender.compare()` function, you can apply PairRM to popular RLHF toolkits such as [trl](https://huggingface.co/docs/trl/index).
**🔥 Check more details on our example jupyter notebook usage: [`blender_usage.ipynb`](https://github.com/yuchenlin/LLM-Blender/blob/main/blender_usage.ipynb)**
Learn more in our LLM-Blender Github [README.md](https://github.com/yuchenlin/LLM-Blender#rank-and-fusion)
## Statistics
### Context length
| PairRanker type | Source max length | Candidate max length | Total max length |
|:-----------------:|:-----------------:|----------------------|------------------|
| [pair-ranker](https://huggingface.co/llm-blender/pair-ranker) (our previous version) | 128 | 128 | 384 |
| [PairRM](https://huggingface.co/llm-blender/pair-reward-model/) (This model) | 1224 | 412 | 2048 |
### Training Datasets
- [openai/summarize_from_feedback](https://huggingface.co/datasets/openai/summarize_from_feedback)
- [openai/webgpt_comparisons](https://huggingface.co/datasets/openai/webgpt_comparisons)
- [Dahoas/instruct-synthetic-prompt-responses](https://huggingface.co/datasets/Dahoas/instruct-synthetic-prompt-responses)
- [Anthropic/hh-rlhf](https://huggingface.co/datasets/Anthropic/hh-rlhf)
- [lmsys/chatbot_arena_conversations](https://huggingface.co/datasets/lmsys/chatbot_arena_conversations)
- [openbmb/UltraFeedback](https://huggingface.co/datasets/openbmb/UltraFeedback)
### Performance
PairRM has been trained on various high-quality and large-scale dataset with human preference annotations and exhibits great correlation with human preferences
with an extremly small model size (0.4B), approching the performance of GPT-4.
We test the pairwise comparison on
- [Auto-J pairwise testdata](https://github.com/GAIR-NLP/auto-j#pairwise-response-comparison)
- [HHH-alignment](https://huggingface.co/datasets/HuggingFaceH4/hhh_alignment)
- [MT-bench-human-judgements](https://huggingface.co/datasets/lmsys/mt_bench_human_judgments)
All following results are reported as pairwise comparison accuracies (agreements).
#### Auto-J Pairwise test data performance
| Model | Summ | Exam | Code | Rewriting | Crea W | Func W | Comm | NLP | Overall |
|:---------------------:|:---------:|:---------:|:---------:|:---------:|:---------:|:---------:|:-----:|:--------:|:---------:|
| Closed -source Models |
| ChatGPT | 33.3 | 40.3 | 36.6 | 31.6 | 48.2 | 40.4 | 47.6 | 45.8 | 42.7 |
| Claude -2 | 30.6 | 36.1 | 41.7 | 34.2 | 48.1 | 42.5 | 40.6 | 48.5 | 42.4 |
| GPT -4 | 59.7 | 51.4 | 69.2 | 58.3 | 66.7 | 60.4 | 58.3 | 65.2 | 61.9 |
| Open -source Models |
| SteamSHP | 33.3 | 29.2 | 26.7 | 33.3 | 40.7 | 31.3 | 51.4 | 51.9 | 40.6 |
| PandaLM | 29.2 | 33.3 | 31.7 | 23.3 | 43.5 | 32.9 | 44.8 | 48.9 | 38.9 |
| LLaMA -2-Chat -13B | 20.8 | 27.8 | 19.2 | 20 | 31.5 | 27.5 | 35.8 | 31.8 | 29 |
| Vicuna -13B-v1.5 | 30.6 | 23.6 | 35 | 28.3 | 36.1 | 37.5 | 45.5 | 39.8 | 37.3 |
| WizardLM -13B-v1.2 | 22.2 | 20.8 | 32.5 | 19.2 | 28.7 | 25.4 | 29.2 | 33 | 27.8 |
| LLAMA -2-chat -70B | 34.7 | 33.3 | 36.7 | 35.8 | 51.4 | 54.2 | 47.2 | 47.7 | 45.9 |
| AUTO -J (13b) | 45.8 | 38.9 | **59.2** | 47.5 | 54.6 | 57.1 | **58** | 57.6 | 54.8 |
| UltraRM (13b) | 56.94 | 43.06 | 55.0 | 53.33 | **67.13** | **64.17** | 56.25 | 59.85 | **59.85** |
| **PairRM (0.4b)** | **56.94** | **52.78** | 58.33 | **55.83** | 61.57 | 59.17 | 57.64 | **62.5** | 59.05 |
#### HHH-Alignment and MT-bench human judgements
| Evaluator LM | HHH ALIGNMENT | | | | | MT BENCH HUMAN JUDG . |
|:-------------------------:|:-------------:|:---------:|:---------:|:--------:|:-----------:|:---------------------:|
| | Help . | Harm . | Hon . | Other | Total Avg . | Human Preference |
| RANDOM | 50 | 50 | 50 | 50 | 50 | 34.26 |
| STANFORDNLP REWARD MODEL | 69.49 | 60.34 | 52.46 | 51.16 | 58.82 | 44.79 |
| ALMOST REWARD MODEL | 74.58 | 67.24 | 78.69 | 86.05 | 76.02 | 49.9 |
| LLAMA2 -CHAT 7B | 66.1 | 81.03 | 70.49 | 74.42 | 72.85 | 51.78 |
| LLAMA2 -CHAT 13B | 74.58 | 87.93 | 55.74 | 79.07 | 73.76 | 52.34 |
| LLAMA2 -CHAT 70B | 66.1 | **89.66** | 67.21 | 74.42 | 74.21 | 53.67 |
| LLAMA2 -CHAT 13B+COARSE . | 68.74 | 68.97 | 65.57 | 67.44 | 67.42 | 46.89 |
| GPT -3.5-TURBO -0613 | 76.27 | 87.93 | 67.21 | 86.05 | 78.73 | 57.12 |
| PROMETHEUS 7B | 69.49 | 84.48 | 78.69 | 90.7 | 80.09 | 55.14 |
| PROMETHEUS 13B | 81.36 | 82.76 | 75.41 | 76.74 | 79.19 | 57.72 |
| UltraRM (13B) | **86.44** | 79.31 | **81.97** | 88.37 | 83.71 | 56 |
| **PairRM (0.4B)** | 84.75 | 84.48 | 80.33 | **90.7** | **84.62** | **59** |
| GPT -4-0613 | 91.53 | 93.1 | 85.25 | 83.72 | 88.69 | 63.87 |
**While PairRM is a extremely small model (0.4B) based on deberta, the pairwise comparison aggrement performance approches GPT-4's performance!**
Two reasons to attribute:
- Our PairRM specically designed model arch for pairwise comparison through bidirectional attention (See LLM-blender paper for more details)
- The high-quality and large-scale human preference annotation data it was train on (see training dataset list on this hugging face page)
## Citation & Credits
If you are using PairRM in your research, please cite LLM-blender.
```bibtex
@inproceedings{llm-blender-2023,
title = "LLM-Blender: Ensembling Large Language Models with Pairwise Comparison and Generative Fusion",
author = "Jiang, Dongfu and Ren, Xiang and Lin, Bill Yuchen",
booktitle = "Proceedings of the 61th Annual Meeting of the Association for Computational Linguistics (ACL 2023)",
year = "2023"
}
```
|