llm-blender
/

PairRM

@@ -36,6 +36,54 @@ Inspired by [DeBERTa Reward Model Series](https://huggingface.co/OpenAssistant/r
 | [PairRM](https://huggingface.co/llm-blender/pair-reward-model/) (This model) | 1224              | 412                  | 2048             |
 ### Performance
 ## Usage Example
@@ -133,7 +181,8 @@ print(outputs[0])
 ```
 ### Use case 3: RLHF
-PairRM has been trained on various high-quality and large-scale dataset with human preference annotations and exhibits great correlation with human preferences with an extremly small model size (0.4B), approching the performance of GPT-4. (See detailed comparison in 🤗[PairRM](https://huggingface.co/llm-blender/PairRM))
 With a `blender.compare()` function, you can easily apply PairRM to poopular RLHF toolkits like [trl](https://huggingface.co/docs/trl/index).
 **🔥 Check more details on our example jupyter notebook usage: [`blender_usage.ipynb`](https://github.com/yuchenlin/LLM-Blender/blob/main/blender_usage.ipynb)**

 | [PairRM](https://huggingface.co/llm-blender/pair-reward-model/) (This model) | 1224              | 412                  | 2048             |
 ### Performance
+PairRM has been trained on various high-quality and large-scale dataset with human preference annotations and exhibits great correlation with human preferences
+with an extremly small model size (0.4B), approching the performance of GPT-4.
+We test the pairwise comparison on
+- [Auto-J pairwise testdata](https://github.com/GAIR-NLP/auto-j#pairwise-response-comparison)
+- [HHH-alignment](https://huggingface.co/datasets/HuggingFaceH4/hhh_alignment)
+- [MT-bench-human-judgements](https://huggingface.co/datasets/lmsys/mt_bench_human_judgments)
+#### Auto-J Pairwise test data performance
+|         Model         |    Summ   |    Exam   |    Code   | Rewriting |   Crea W  |   Func W  |  Comm |    NLP   |  Overall  |
+|:---------------------:|:---------:|:---------:|:---------:|:---------:|:---------:|:---------:|:-----:|:--------:|:---------:|
+| Closed -source Models |           |           |           |           |           |           |       |          |           |
+|        ChatGPT        |    33.3   |    40.3   |    36.6   |    31.6   |    48.2   |    40.4   |  47.6 |   45.8   |    42.7   |
+|       Claude -2       |    30.6   |    36.1   |    41.7   |    34.2   |    48.1   |    42.5   |  40.6 |   48.5   |    42.4   |
+|         GPT -4        |    59.7   |    51.4   |    69.2   |    58.3   |    66.7   |    60.4   |  58.3 |   65.2   |    61.9   |
+|  Open -source Models  |           |           |           |           |           |           |       |          |           |
+|        SteamSHP       |    33.3   |    29.2   |    26.7   |    33.3   |    40.7   |    31.3   |  51.4 |   51.9   |    40.6   |
+|        PandaLM        |    29.2   |    33.3   |    31.7   |    23.3   |    43.5   |    32.9   |  44.8 |   48.9   |    38.9   |
+|   LLaMA -2-Chat -13B  |    20.8   |    27.8   |    19.2   |     20    |    31.5   |    27.5   |  35.8 |   31.8   |     29    |
+|    Vicuna -13B-v1.5   |    30.6   |    23.6   |     35    |    28.3   |    36.1   |    37.5   |  45.5 |   39.8   |    37.3   |
+|   WizardLM -13B-v1.2  |    22.2   |    20.8   |    32.5   |    19.2   |    28.7   |    25.4   |  29.2 |    33    |    27.8   |
+|   LLAMA -2-chat -70B  |    34.7   |    33.3   |    36.7   |    35.8   |    51.4   |    54.2   |  47.2 |   47.7   |    45.9   |
+|       AUTO -J 1       |    45.8   |    38.9   |    59.2   |    47.5   |    54.6   |    57.1   |   58  |   57.6   |    54.8   |
+|         PairRM        | **56.94** | **52.78** | **58.33** | **55.83** | **61.57** | **59.17** | 57.64 | **62.5** | **59.05** |
+#### HHH-Alignment and MT-bench human judgements
+|        Evaluator LM       | HHH ALIGNMENT |           |           |          |             | MT BENCH HUMAN JUDG . |
+|:-------------------------:|:-------------:|:---------:|:---------:|:--------:|:-----------:|:---------------------:|
+|                           |     Help .    |   Harm .  |   Hon .   |   Other  | Total Avg . |    Human Preference   |
+|           RANDOM          |       50      |     50    |     50    |    50    |      50     |         34.26         |
+|  STANFORDNLP REWARD MODEL |     69.49     |   60.34   |   52.46   |   51.16  |    58.82    |         44.79         |
+|    ALMOST REWARD MODEL    |     74.58     |   67.24   |   78.69   |   86.05  |    76.02    |          49.9         |
+|      LLAMA2 -CHAT 7B      |      66.1     |   81.03   |   70.49   |   74.42  |    72.85    |         51.78         |
+|      LLAMA2 -CHAT 13B     |     74.58     |   87.93   |   55.74   |   79.07  |    73.76    |         52.34         |
+|      LLAMA2 -CHAT 70B     |      66.1     |   89.66   |   67.21   |   74.42  |    74.21    |         53.67         |
+| LLAMA2 -CHAT 13B+COARSE . |     68.74     |   68.97   |   65.57   |   67.44  |    67.42    |         46.89         |
+|    GPT -3.5-TURBO -0613   |     76.27     |   87.93   |   67.21   |   86.05  |    78.73    |         57.12         |
+|       PROMETHEUS 7B       |     69.49     |   84.48   |   78.69   |   90.7   |    80.09    |         55.14         |
+|       PROMETHEUS 13B      |     81.36     |   82.76   |   75.41   |   76.74  |    79.19    |         57.72         |
+|           PairRM          |   **84.75**   | **84.48** | **80.33** | **90.7** |  **84.62**  |         **59**        |
+|        GPT -4-0613        |     91.53     |    93.1   |   85.25   |   83.72  |    88.69    |         63.87         |
+While PairRM is a extremely small model (0.4B) based on deberta, the pairwise comparison aggrement performance approches GPT-4's performance!
+Two reasons to attribute:
+- Our PairRM specically designed model arch for pairwise comparison through bidirectional attention (See paper for more details)
+- The high-quality and large-scale human preference annotation data it was train on (see tags for list)
 ## Usage Example
 ```
 ### Use case 3: RLHF
+PairRM has been trained on various high-quality and large-scale dataset with human preference annotations and exhibits great correlation with human preferences with an extremly small model size (0.4B), approching the performance of GPT-4.
+We believe PairRM will power the alignment of LLM in an efficient and effective way.
 With a `blender.compare()` function, you can easily apply PairRM to poopular RLHF toolkits like [trl](https://huggingface.co/docs/trl/index).
 **🔥 Check more details on our example jupyter notebook usage: [`blender_usage.ipynb`](https://github.com/yuchenlin/LLM-Blender/blob/main/blender_usage.ipynb)**