Dongfu Jiang commited on
Commit
0ef6e21
1 Parent(s): bb45a4c

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +50 -1
README.md CHANGED
@@ -36,6 +36,54 @@ Inspired by [DeBERTa Reward Model Series](https://huggingface.co/OpenAssistant/r
36
  | [PairRM](https://huggingface.co/llm-blender/pair-reward-model/) (This model) | 1224 | 412 | 2048 |
37
 
38
  ### Performance
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
39
 
40
 
41
  ## Usage Example
@@ -133,7 +181,8 @@ print(outputs[0])
133
  ```
134
 
135
  ### Use case 3: RLHF
136
- PairRM has been trained on various high-quality and large-scale dataset with human preference annotations and exhibits great correlation with human preferences with an extremly small model size (0.4B), approching the performance of GPT-4. (See detailed comparison in 🤗[PairRM](https://huggingface.co/llm-blender/PairRM))
 
137
  With a `blender.compare()` function, you can easily apply PairRM to poopular RLHF toolkits like [trl](https://huggingface.co/docs/trl/index).
138
 
139
  **🔥 Check more details on our example jupyter notebook usage: [`blender_usage.ipynb`](https://github.com/yuchenlin/LLM-Blender/blob/main/blender_usage.ipynb)**
 
36
  | [PairRM](https://huggingface.co/llm-blender/pair-reward-model/) (This model) | 1224 | 412 | 2048 |
37
 
38
  ### Performance
39
+ PairRM has been trained on various high-quality and large-scale dataset with human preference annotations and exhibits great correlation with human preferences
40
+ with an extremly small model size (0.4B), approching the performance of GPT-4.
41
+
42
+ We test the pairwise comparison on
43
+ - [Auto-J pairwise testdata](https://github.com/GAIR-NLP/auto-j#pairwise-response-comparison)
44
+ - [HHH-alignment](https://huggingface.co/datasets/HuggingFaceH4/hhh_alignment)
45
+ - [MT-bench-human-judgements](https://huggingface.co/datasets/lmsys/mt_bench_human_judgments)
46
+
47
+ #### Auto-J Pairwise test data performance
48
+
49
+ | Model | Summ | Exam | Code | Rewriting | Crea W | Func W | Comm | NLP | Overall |
50
+ |:---------------------:|:---------:|:---------:|:---------:|:---------:|:---------:|:---------:|:-----:|:--------:|:---------:|
51
+ | Closed -source Models | | | | | | | | | |
52
+ | ChatGPT | 33.3 | 40.3 | 36.6 | 31.6 | 48.2 | 40.4 | 47.6 | 45.8 | 42.7 |
53
+ | Claude -2 | 30.6 | 36.1 | 41.7 | 34.2 | 48.1 | 42.5 | 40.6 | 48.5 | 42.4 |
54
+ | GPT -4 | 59.7 | 51.4 | 69.2 | 58.3 | 66.7 | 60.4 | 58.3 | 65.2 | 61.9 |
55
+ | Open -source Models | | | | | | | | | |
56
+ | SteamSHP | 33.3 | 29.2 | 26.7 | 33.3 | 40.7 | 31.3 | 51.4 | 51.9 | 40.6 |
57
+ | PandaLM | 29.2 | 33.3 | 31.7 | 23.3 | 43.5 | 32.9 | 44.8 | 48.9 | 38.9 |
58
+ | LLaMA -2-Chat -13B | 20.8 | 27.8 | 19.2 | 20 | 31.5 | 27.5 | 35.8 | 31.8 | 29 |
59
+ | Vicuna -13B-v1.5 | 30.6 | 23.6 | 35 | 28.3 | 36.1 | 37.5 | 45.5 | 39.8 | 37.3 |
60
+ | WizardLM -13B-v1.2 | 22.2 | 20.8 | 32.5 | 19.2 | 28.7 | 25.4 | 29.2 | 33 | 27.8 |
61
+ | LLAMA -2-chat -70B | 34.7 | 33.3 | 36.7 | 35.8 | 51.4 | 54.2 | 47.2 | 47.7 | 45.9 |
62
+ | AUTO -J 1 | 45.8 | 38.9 | 59.2 | 47.5 | 54.6 | 57.1 | 58 | 57.6 | 54.8 |
63
+ | PairRM | **56.94** | **52.78** | **58.33** | **55.83** | **61.57** | **59.17** | 57.64 | **62.5** | **59.05** |
64
+
65
+ #### HHH-Alignment and MT-bench human judgements
66
+
67
+ | Evaluator LM | HHH ALIGNMENT | | | | | MT BENCH HUMAN JUDG . |
68
+ |:-------------------------:|:-------------:|:---------:|:---------:|:--------:|:-----------:|:---------------------:|
69
+ | | Help . | Harm . | Hon . | Other | Total Avg . | Human Preference |
70
+ | RANDOM | 50 | 50 | 50 | 50 | 50 | 34.26 |
71
+ | STANFORDNLP REWARD MODEL | 69.49 | 60.34 | 52.46 | 51.16 | 58.82 | 44.79 |
72
+ | ALMOST REWARD MODEL | 74.58 | 67.24 | 78.69 | 86.05 | 76.02 | 49.9 |
73
+ | LLAMA2 -CHAT 7B | 66.1 | 81.03 | 70.49 | 74.42 | 72.85 | 51.78 |
74
+ | LLAMA2 -CHAT 13B | 74.58 | 87.93 | 55.74 | 79.07 | 73.76 | 52.34 |
75
+ | LLAMA2 -CHAT 70B | 66.1 | 89.66 | 67.21 | 74.42 | 74.21 | 53.67 |
76
+ | LLAMA2 -CHAT 13B+COARSE . | 68.74 | 68.97 | 65.57 | 67.44 | 67.42 | 46.89 |
77
+ | GPT -3.5-TURBO -0613 | 76.27 | 87.93 | 67.21 | 86.05 | 78.73 | 57.12 |
78
+ | PROMETHEUS 7B | 69.49 | 84.48 | 78.69 | 90.7 | 80.09 | 55.14 |
79
+ | PROMETHEUS 13B | 81.36 | 82.76 | 75.41 | 76.74 | 79.19 | 57.72 |
80
+ | PairRM | **84.75** | **84.48** | **80.33** | **90.7** | **84.62** | **59** |
81
+ | GPT -4-0613 | 91.53 | 93.1 | 85.25 | 83.72 | 88.69 | 63.87 |
82
+
83
+ While PairRM is a extremely small model (0.4B) based on deberta, the pairwise comparison aggrement performance approches GPT-4's performance!
84
+ Two reasons to attribute:
85
+ - Our PairRM specically designed model arch for pairwise comparison through bidirectional attention (See paper for more details)
86
+ - The high-quality and large-scale human preference annotation data it was train on (see tags for list)
87
 
88
 
89
  ## Usage Example
 
181
  ```
182
 
183
  ### Use case 3: RLHF
184
+ PairRM has been trained on various high-quality and large-scale dataset with human preference annotations and exhibits great correlation with human preferences with an extremly small model size (0.4B), approching the performance of GPT-4.
185
+ We believe PairRM will power the alignment of LLM in an efficient and effective way.
186
  With a `blender.compare()` function, you can easily apply PairRM to poopular RLHF toolkits like [trl](https://huggingface.co/docs/trl/index).
187
 
188
  **🔥 Check more details on our example jupyter notebook usage: [`blender_usage.ipynb`](https://github.com/yuchenlin/LLM-Blender/blob/main/blender_usage.ipynb)**