llm-blender
/

PairRM

@@ -28,6 +28,7 @@ Inspired by [DeBERTa Reward Model Series](https://huggingface.co/OpenAssistant/r
 - Paper: [https://arxiv.org/abs/2306.02561](https://arxiv.org/abs/2306.02561)
 - Space Demo: [https://huggingface.co/spaces/llm-blender/LLM-Blender](https://huggingface.co/spaces/llm-blender/LLM-Blender)
 ## Statistics
 ### Context length
@@ -36,58 +37,6 @@ Inspired by [DeBERTa Reward Model Series](https://huggingface.co/OpenAssistant/r
 | [pair-ranker](https://huggingface.co/llm-blender/pair-ranker)               | 128               | 128                  | 384              |
 | [PairRM](https://huggingface.co/llm-blender/pair-reward-model/) (This model) | 1224              | 412                  | 2048             |
-### Performance
-PairRM has been trained on various high-quality and large-scale dataset with human preference annotations and exhibits great correlation with human preferences
-with an extremly small model size (0.4B), approching the performance of GPT-4.
-We test the pairwise comparison on
-- [Auto-J pairwise testdata](https://github.com/GAIR-NLP/auto-j#pairwise-response-comparison)
-- [HHH-alignment](https://huggingface.co/datasets/HuggingFaceH4/hhh_alignment)
-- [MT-bench-human-judgements](https://huggingface.co/datasets/lmsys/mt_bench_human_judgments)
-#### Auto-J Pairwise test data performance
-|         Model         |    Summ   |    Exam   |    Code   | Rewriting |   Crea W  |   Func W  |  Comm |    NLP   |  Overall  |
-|:---------------------:|:---------:|:---------:|:---------:|:---------:|:---------:|:---------:|:-----:|:--------:|:---------:|
-| Closed -source Models |
-|        ChatGPT        |    33.3   |    40.3   |    36.6   |    31.6   |    48.2   |    40.4   |  47.6 |   45.8   |    42.7   |
-|       Claude -2       |    30.6   |    36.1   |    41.7   |    34.2   |    48.1   |    42.5   |  40.6 |   48.5   |    42.4   |
-|         GPT -4        |    59.7   |    51.4   |    69.2   |    58.3   |    66.7   |    60.4   |  58.3 |   65.2   |    61.9   |
-|  Open -source Models  |
-|        SteamSHP       |    33.3   |    29.2   |    26.7   |    33.3   |    40.7   |    31.3   |  51.4 |   51.9   |    40.6   |
-|        PandaLM        |    29.2   |    33.3   |    31.7   |    23.3   |    43.5   |    32.9   |  44.8 |   48.9   |    38.9   |
-|   LLaMA -2-Chat -13B  |    20.8   |    27.8   |    19.2   |     20    |    31.5   |    27.5   |  35.8 |   31.8   |     29    |
-|    Vicuna -13B-v1.5   |    30.6   |    23.6   |     35    |    28.3   |    36.1   |    37.5   |  45.5 |   39.8   |    37.3   |
-|   WizardLM -13B-v1.2  |    22.2   |    20.8   |    32.5   |    19.2   |    28.7   |    25.4   |  29.2 |    33    |    27.8   |
-|   LLAMA -2-chat -70B  |    34.7   |    33.3   |    36.7   |    35.8   |    51.4   |    54.2   |  47.2 |   47.7   |    45.9   |
-|       AUTO -J (13b)       |    45.8   |    38.9   |    59.2   |    47.5   |    54.6   |    57.1   |   **58**  |   57.6   |    54.8   |
-|         **PairRM (0.4b)**       | **56.94** | **52.78** | **58.33** | **55.83** | **61.57** | **59.17** | 57.64 | **62.5** | **59.05** |
-#### HHH-Alignment and MT-bench human judgements
-|        Evaluator LM       | HHH ALIGNMENT |           |           |          |             | MT BENCH HUMAN JUDG . |
-|:-------------------------:|:-------------:|:---------:|:---------:|:--------:|:-----------:|:---------------------:|
-|                           |     Help .    |   Harm .  |   Hon .   |   Other  | Total Avg . |    Human Preference   |
-|           RANDOM          |       50      |     50    |     50    |    50    |      50     |         34.26         |
-|  STANFORDNLP REWARD MODEL |     69.49     |   60.34   |   52.46   |   51.16  |    58.82    |         44.79         |
-|    ALMOST REWARD MODEL    |     74.58     |   67.24   |   78.69   |   86.05  |    76.02    |          49.9         |
-|      LLAMA2 -CHAT 7B      |      66.1     |   81.03   |   70.49   |   74.42  |    72.85    |         51.78         |
-|      LLAMA2 -CHAT 13B     |     74.58     |   87.93   |   55.74   |   79.07  |    73.76    |         52.34         |
-|      LLAMA2 -CHAT 70B     |      66.1     |   **89.66**   |   67.21   |   74.42  |    74.21    |         53.67         |
-| LLAMA2 -CHAT 13B+COARSE . |     68.74     |   68.97   |   65.57   |   67.44  |    67.42    |         46.89         |
-|    GPT -3.5-TURBO -0613   |     76.27     |   87.93   |   67.21   |   86.05  |    78.73    |         57.12         |
-|       PROMETHEUS 7B       |     69.49     |   84.48   |   78.69   |   90.7   |    80.09    |         55.14         |
-|       PROMETHEUS 13B      |     81.36     |   82.76   |   75.41   |   76.74  |    79.19    |         57.72         |
-|           **PairRM (0.4b)**          |   **84.75**   |   84.48   | **80.33** | **90.7** |  **84.62**  |         **59**        |
-|        GPT -4-0613        |     91.53     |    93.1   |   85.25   |   83.72  |    88.69    |         63.87         |
-**While PairRM is a extremely small model (0.4B) based on deberta, the pairwise comparison aggrement performance approches GPT-4's performance!**
-Two reasons to attribute:
-- Our PairRM specically designed model arch for pairwise comparison through bidirectional attention (See LLM-blender paper for more details)
-- The high-quality and large-scale human preference annotation data it was train on (see training dataset list on this hugging face page)
 ## Usage Example
 ### Installation
@@ -192,6 +141,60 @@ With a `blender.compare()` function, you can easily apply PairRM to poopular RLH
 Learn more in our LLM-Blender Github [README.md](https://github.com/yuchenlin/LLM-Blender#rank-and-fusion)
 ## Citation
 If you are using PairRM in your research, please cite LLM-blender.
 ```bibtex

 - Paper: [https://arxiv.org/abs/2306.02561](https://arxiv.org/abs/2306.02561)
 - Space Demo: [https://huggingface.co/spaces/llm-blender/LLM-Blender](https://huggingface.co/spaces/llm-blender/LLM-Blender)
 ## Statistics
 ### Context length
 | [pair-ranker](https://huggingface.co/llm-blender/pair-ranker)               | 128               | 128                  | 384              |
 | [PairRM](https://huggingface.co/llm-blender/pair-reward-model/) (This model) | 1224              | 412                  | 2048             |
 ## Usage Example
 ### Installation
 Learn more in our LLM-Blender Github [README.md](https://github.com/yuchenlin/LLM-Blender#rank-and-fusion)
+### Performance
+PairRM has been trained on various high-quality and large-scale dataset with human preference annotations and exhibits great correlation with human preferences
+with an extremly small model size (0.4B), approching the performance of GPT-4.
+We test the pairwise comparison on
+- [Auto-J pairwise testdata](https://github.com/GAIR-NLP/auto-j#pairwise-response-comparison)
+- [HHH-alignment](https://huggingface.co/datasets/HuggingFaceH4/hhh_alignment)
+- [MT-bench-human-judgements](https://huggingface.co/datasets/lmsys/mt_bench_human_judgments)
+#### Auto-J Pairwise test data performance
+|         Model         |    Summ   |    Exam   |    Code   | Rewriting |   Crea W  |   Func W  |  Comm |    NLP   |  Overall  |
+|:---------------------:|:---------:|:---------:|:---------:|:---------:|:---------:|:---------:|:-----:|:--------:|:---------:|
+| Closed -source Models |
+|        ChatGPT        |    33.3   |    40.3   |    36.6   |    31.6   |    48.2   |    40.4   |  47.6 |   45.8   |    42.7   |
+|       Claude -2       |    30.6   |    36.1   |    41.7   |    34.2   |    48.1   |    42.5   |  40.6 |   48.5   |    42.4   |
+|         GPT -4        |    59.7   |    51.4   |    69.2   |    58.3   |    66.7   |    60.4   |  58.3 |   65.2   |    61.9   |
+|  Open -source Models  |
+|        SteamSHP       |    33.3   |    29.2   |    26.7   |    33.3   |    40.7   |    31.3   |  51.4 |   51.9   |    40.6   |
+|        PandaLM        |    29.2   |    33.3   |    31.7   |    23.3   |    43.5   |    32.9   |  44.8 |   48.9   |    38.9   |
+|   LLaMA -2-Chat -13B  |    20.8   |    27.8   |    19.2   |     20    |    31.5   |    27.5   |  35.8 |   31.8   |     29    |
+|    Vicuna -13B-v1.5   |    30.6   |    23.6   |     35    |    28.3   |    36.1   |    37.5   |  45.5 |   39.8   |    37.3   |
+|   WizardLM -13B-v1.2  |    22.2   |    20.8   |    32.5   |    19.2   |    28.7   |    25.4   |  29.2 |    33    |    27.8   |
+|   LLAMA -2-chat -70B  |    34.7   |    33.3   |    36.7   |    35.8   |    51.4   |    54.2   |  47.2 |   47.7   |    45.9   |
+|       AUTO -J (13b)       |    45.8   |    38.9   |    59.2   |    47.5   |    54.6   |    57.1   |   **58**  |   57.6   |    54.8   |
+|         **PairRM (0.4b)**       | **56.94** | **52.78** | **58.33** | **55.83** | **61.57** | **59.17** | 57.64 | **62.5** | **59.05** |
+#### HHH-Alignment and MT-bench human judgements
+|        Evaluator LM       | HHH ALIGNMENT |           |           |          |             | MT BENCH HUMAN JUDG . |
+|:-------------------------:|:-------------:|:---------:|:---------:|:--------:|:-----------:|:---------------------:|
+|                           |     Help .    |   Harm .  |   Hon .   |   Other  | Total Avg . |    Human Preference   |
+|           RANDOM          |       50      |     50    |     50    |    50    |      50     |         34.26         |
+|  STANFORDNLP REWARD MODEL |     69.49     |   60.34   |   52.46   |   51.16  |    58.82    |         44.79         |
+|    ALMOST REWARD MODEL    |     74.58     |   67.24   |   78.69   |   86.05  |    76.02    |          49.9         |
+|      LLAMA2 -CHAT 7B      |      66.1     |   81.03   |   70.49   |   74.42  |    72.85    |         51.78         |
+|      LLAMA2 -CHAT 13B     |     74.58     |   87.93   |   55.74   |   79.07  |    73.76    |         52.34         |
+|      LLAMA2 -CHAT 70B     |      66.1     |   **89.66**   |   67.21   |   74.42  |    74.21    |         53.67         |
+| LLAMA2 -CHAT 13B+COARSE . |     68.74     |   68.97   |   65.57   |   67.44  |    67.42    |         46.89         |
+|    GPT -3.5-TURBO -0613   |     76.27     |   87.93   |   67.21   |   86.05  |    78.73    |         57.12         |
+|       PROMETHEUS 7B       |     69.49     |   84.48   |   78.69   |   90.7   |    80.09    |         55.14         |
+|       PROMETHEUS 13B      |     81.36     |   82.76   |   75.41   |   76.74  |    79.19    |         57.72         |
+|           **PairRM (0.4b)**          |   **84.75**   |   84.48   | **80.33** | **90.7** |  **84.62**  |         **59**        |
+|        GPT -4-0613        |     91.53     |    93.1   |   85.25   |   83.72  |    88.69    |         63.87         |
+**While PairRM is a extremely small model (0.4B) based on deberta, the pairwise comparison aggrement performance approches GPT-4's performance!**
+Two reasons to attribute:
+- Our PairRM specically designed model arch for pairwise comparison through bidirectional attention (See LLM-blender paper for more details)
+- The high-quality and large-scale human preference annotation data it was train on (see training dataset list on this hugging face page)
 ## Citation
 If you are using PairRM in your research, please cite LLM-blender.
 ```bibtex