File size: 8,638 Bytes
9ceb843
 
 
 
65e180d
 
8799e00
65e180d
 
 
 
054ed2d
65e180d
 
 
 
 
 
 
bbe05a0
65e180d
8799e00
65e180d
9ceb843
ab74236
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
9ceb843
 
 
ab74236
 
bbe05a0
 
ab74236
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
ABOUT_TEXT = """
We compute the win percentage for a reward model on hand curated chosen-rejected pairs for each prompt.
A win is when the score for the chosen response is higher than the score for the rejected response.

## Overview

We average over 4 core sections (per prompt weighting):
1. **Chat**: Includes the easy chat subsets (alpacaeval-easy, alpacaeval-length, alpacaeval-hard, mt-bench-easy, mt-bench-medium)
2. **Chat Hard**: Includes the hard chat subsets (mt-bench-hard, llmbar-natural, llmbar-adver-neighbor, llmbar-adver-GPTInst, llmbar-adver-GPTOut, llmbar-adver-manual)
3. **Safety**: Includes the safety subsets (refusals-dangerous, refusals-offensive, xstest-should-refuse, xstest-should-respond, do not answer)
4. **Code**: Includes the code subsets (hep-cpp, hep-go, hep-java, hep-js, hep-python, hep-rust)
5. **Test Sets**: Includes the test sets (anthropic_helpful, mtbench_gpt4, shp, summarize)

We include multiple types of reward models in this evaluation:
1. **Sequence Classifiers** (Seq. Classifier): A model, normally trained with HuggingFace AutoModelForSequenceClassification, that takes in a prompt and a response and outputs a score.
2. **Custom Classifiers**: Research models with different architectures and training objectives to either take in two inputs at once or generate scores differently (e.g. PairRM and Stanford SteamSHP).
3. **DPO**: Models trained with Direct Preference Optimization (DPO), with modifiers such as `-ref-free` or `-norm` changing how scores are computed.
4. **Random**: Random choice baseline.

All models are evaluated in fp16 expect for Starling-7B, which is evaluated in fp32.
Others, such as **Generative Judge** are coming soon.

### Subset Details

Total number of the prompts is: 2538, filtered from 4676.

| Subset             | Num. Samples (Pre-filtering, post-filtering) | Description |
| :---------- | :-----: | :---------: |
| alpacaeval-easy    | 805, 100          | Great model vs poor model            |
| alpacaeval-length    | 805, 95          | Good model vs low model, equal length            |
| alpacaeval-hard    | 805, 95          | Great model vs baseline model            |
| mt-bench-easy      | 28, 28           | MT Bench 10s vs 1s            |
| mt-bench-medium    | 45, 40           | MT Bench 9s vs 2-5s            |
| mt-bench-hard      | 45, 37          | MT Bench 7-8 vs 5-6            |
| refusals-dangerous | 505, 100          | Dangerous response vs no response            |
| refusals-offensive | 704, 100          | Offensive response vs no response            |
| llmbar-natural     | 100          | (See [paper](https://arxiv.org/abs/2310.07641)) Manually curated instruction pairs |
| llmbar-adver-neighbor | 134          | (See [paper](https://arxiv.org/abs/2310.07641)) Instruction response vs. off-topic prompt response |
| llmbar-adver-GPTInst | 92          | (See [paper](https://arxiv.org/abs/2310.07641)) Instruction response vs. GPT4 generated off-topic prompt response |
| llmbar-adver-GPTOut |  47          | (See [paper](https://arxiv.org/abs/2310.07641)) Instruction response vs. unhelpful-prompted GPT4 responses |
| llmbar-adver-manual |  46          | (See [paper](https://arxiv.org/abs/2310.07641)) Challenge set chosen vs. rejected |
| xstest-should-refuse | 450, 250         | False response dataset (see [paper](https://arxiv.org/abs/2308.01263))        |
| xstest-should-respond | 450, 154         | False refusal dataset (see [paper](https://arxiv.org/abs/2308.01263))        |
| do not answer | 939, 136         | [Prompts which responsible LLMs do not answer](https://huggingface.co/datasets/LibrAI/do-not-answer)        |
| hep-cpp | 164         | C++ code revisions (See [dataset](https://huggingface.co/datasets/bigcode/humanevalpack) or [paper](https://arxiv.org/abs/2308.07124))        |
| hep-go | 164         |   Go code       |
| hep-java | 164         |  Java code        |
| hep-js | 164         |    Javascript code        |
| hep-python | 164         |  Python code         |
| hep-rust | 164         |   Rust code        |

Lengths (mean, std. dev.) include the prompt

| subset                | length bias | chosen_chars   | rejected_chars   | chosen_tokens   | rejected_tokens   | chosen_unique_tokens   | rejected_unique_tokens   |
|-----------------------|-------------|----------------|------------------|-----------------|-------------------|------------------------|--------------------------|
| alpacaeval-easy       | True        | 2283 (1138)    | 646 (482)        | 591 (303)       | 167 (139)         | 253 (117)              | 83 (46)                  |
| alpacaeval-hard       | True        | 1590 (769)     | 526 (430)        | 412 (199)       | 137 (117)         | 173 (67)               | 71 (48)                  |
| alpacaeval-length     | Neutral       | 2001 (1137)    | 2127 (1787)      | 511 (283)       | 597 (530)         | 192 (85)               | 189 (99)                 |
| donotanswer           | False       | 755 (722)      | 1389 (695)       | 170 (161)       | 320 (164)         | 104 (82)               | 157 (73)                 |
| hep-cpp               | Neutral     | 709 (341)      | 705 (342)        | 261 (125)       | 259 (125)         | 100 (29)               | 99 (29)                  |
| hep-go                | Neutral     | 738 (361)      | 734 (361)        | 266 (118)       | 265 (118)         | 100 (29)               | 99 (29)                  |
| hep-java              | Neutral     | 821 (393)      | 814 (390)        | 263 (123)       | 261 (122)         | 102 (30)               | 102 (30)                 |
| hep-js                | Neutral     | 677 (341)      | 673 (339)        | 251 (129)       | 250 (128)         | 93 (29)                | 93 (29)                  |
| hep-python            | Neutral     | 618 (301)      | 616 (300)        | 212 (98)        | 211 (98)          | 86 (26)                | 85 (26)                  |
| hep-rust              | Neutral     | 666 (391)      | 660 (391)        | 221 (132)       | 219 (132)         | 95 (29)                | 95 (29)                  |
| llmbar-adver-GPTInst  | False       | 735 (578)      | 1623 (1055)      | 170 (135)       | 377 (245)         | 93 (59)                | 179 (106)                |
| llmbar-adver-GPTOut   | Neutral     | 378 (339)      | 359 (319)        | 96 (81)         | 101 (94)          | 60 (45)                | 55 (41)                  |
| llmbar-adver-manual   | False       | 666 (584)      | 1139 (866)       | 160 (134)       | 264 (194)         | 92 (63)                | 140 (90)                 |
| llmbar-adver-neighbor | False       | 287 (297)      | 712 (749)        | 70 (76)         | 173 (175)         | 43 (31)                | 91 (70)                  |
| llmbar-natural        | Neutral     | 553 (644)      | 530 (597)        | 139 (162)       | 130 (140)         | 75 (71)                | 70 (62)                  |
| mt-bench-easy         | False       | 1563 (720)     | 2129 (1520)      | 377 (159)       | 551 (415)         | 166 (55)               | 116 (62)                 |
| mt-bench-hard         | False       | 1225 (499)     | 1471 (1016)      | 284 (116)       | 349 (234)         | 131 (45)               | 136 (58)                 |
| mt-bench-med          | Neutral       | 1558 (729)     | 1733 (1312)      | 377 (170)       | 410 (311)         | 162 (58)               | 145 (88)                 |
| refusals-dangerous    | False       | 597 (81)       | 1828 (547)       | 131 (20)        | 459 (136)         | 90 (12)                | 211 (50)                 |
| refusals-offensive    | False       | 365 (116)      | 1092 (1146)      | 82 (25)         | 299 (278)         | 64 (15)                | 134 (101)                |
| xstest-should-refuse  | False       | 584 (419)      | 904 (493)        | 129 (89)        | 217 (115)         | 81 (47)                | 116 (53)                 |
| xstest-should-respond | True        | 771 (420)      | 466 (427)        | 189 (105)       | 107 (94)          | 104 (48)               | 67 (48)                  |

For more details, see the [dataset](https://huggingface.co/datasets/ai2-rlhf-collab/rm-benchmark-dev).
"""

TOP_TEXT = """
# RewardBench: Benchmarking Reward Models
### Evaluating the capabilities, safety, and pitfalls of reward models
[Code](https://github.com/allenai/herm) | [Eval. Dataset](https://huggingface.co/datasets/ai2-adapt-dev/rm-benchmark-dev) | [Existing Test Sets](https://huggingface.co/datasets/allenai/pref-test-sets) | [Results](https://huggingface.co/datasets/ai2-adapt-dev/HERM-Results) | Paper (coming soon)
"""