Spaces:

allenai
/

reward-bench

Running

App Files Files Community

natolambert commited on Feb 7

Commit

702ff77

•

1 Parent(s): 7b96731

Update src/md.py

Browse files

Files changed (1) hide show

src/md.py +13 -8

src/md.py CHANGED Viewed

@@ -6,22 +6,27 @@ A win is when the score for the chosen response is higher than the score for the
 | Subset                 | Num. Samples (Pre-filtering, post-filtering) | Description                                                       |
 | :--------------------- | :------------------------------------------: | :---------------------------------------------------------------- |
-| alpacaeval-easy        |                     805                     | Great model vs poor model                                         |
-| alpacaeval-length      |                     805                     | Good model vs low model, equal length                             |
-| alpacaeval-hard        |                     805                     | Great model vs baseline model                                     |
 | mt-bench-easy          |                  28, 28                    | MT Bench 10s vs 1s                                                |
 | mt-bench-medium        |                  45, 40                    | MT Bench 9s vs 2-5s                                               |
 | mt-bench-hard          |                  45, 37                    | MT Bench 7-8 vs 5-6                                               |
-| refusals-dangerous     |                     505                     | Dangerous response vs no response                                 |
-| refusals-offensive     |                     704                     | Offensive response vs no response                                 |
 | llmbar-natural         |                     100                     | (See [paper](https://arxiv.org/abs/2310.07641)) Manually curated instruction pairs |
 | llmbar-adver-neighbor  |                     134                     | (See [paper](https://arxiv.org/abs/2310.07641)) Instruction response vs. off-topic prompt response |
 | llmbar-adver-GPTInst   |                     92                      | (See [paper](https://arxiv.org/abs/2310.07641)) Instruction response vs. GPT4 generated off-topic prompt response |
 | llmbar-adver-GPTOut    |                     47                      | (See [paper](https://arxiv.org/abs/2310.07641)) Instruction response vs. unhelpful-prompted GPT4 responses |
 | llmbar-adver-manual    |                     46                      | (See [paper](https://arxiv.org/abs/2310.07641)) Challenge set chosen vs. rejected |
-| XSTest                 |                     450                     | TODO curate                                                       |
-| (?) repetitiveness     |                                               |                                                                   |
-| (?) grammar            |                                               |                                                                   |
 For more details, see the [dataset](https://huggingface.co/datasets/ai2-rlhf-collab/rm-benchmark-dev).

 | Subset                 | Num. Samples (Pre-filtering, post-filtering) | Description                                                       |
 | :--------------------- | :------------------------------------------: | :---------------------------------------------------------------- |
+| alpacaeval-easy        |                     805, 100                     | Great model vs poor model                                         |
+| alpacaeval-length      |                     805, 95                     | Good model vs low model, equal length                             |
+| alpacaeval-hard        |                     805, 95                     | Great model vs baseline model                                     |
 | mt-bench-easy          |                  28, 28                    | MT Bench 10s vs 1s                                                |
 | mt-bench-medium        |                  45, 40                    | MT Bench 9s vs 2-5s                                               |
 | mt-bench-hard          |                  45, 37                    | MT Bench 7-8 vs 5-6                                               |
+| refusals-dangerous     |                     505, 100                     | Dangerous response vs no response                                 |
+| refusals-offensive     |                     704, 100                     | Offensive response vs no response                                 |
 | llmbar-natural         |                     100                     | (See [paper](https://arxiv.org/abs/2310.07641)) Manually curated instruction pairs |
 | llmbar-adver-neighbor  |                     134                     | (See [paper](https://arxiv.org/abs/2310.07641)) Instruction response vs. off-topic prompt response |
 | llmbar-adver-GPTInst   |                     92                      | (See [paper](https://arxiv.org/abs/2310.07641)) Instruction response vs. GPT4 generated off-topic prompt response |
 | llmbar-adver-GPTOut    |                     47                      | (See [paper](https://arxiv.org/abs/2310.07641)) Instruction response vs. unhelpful-prompted GPT4 responses |
 | llmbar-adver-manual    |                     46                      | (See [paper](https://arxiv.org/abs/2310.07641)) Challenge set chosen vs. rejected |
+| XSTest | 450, 404         | False refusal dataset (see [paper](https://arxiv.org/abs/2308.01263))        |
+| do not answer | 939, 136         | [Prompts which responsible LLMs do not answer](https://huggingface.co/datasets/LibrAI/do-not-answer)        |
+| hep-cpp | 164         | C++ code revisions (See [dataset](https://huggingface.co/datasets/bigcode/humanevalpack) or [paper](https://arxiv.org/abs/2308.07124))        |
+| hep-go | 164         |   Go code       |
+| hep-java | 164         |  Java code        |
+| hep-js | 164         |    Javascript code        |
+| hep-python | 164         |  Python code         |
+| hep-rust | 164         |   Rust code        |
 For more details, see the [dataset](https://huggingface.co/datasets/ai2-rlhf-collab/rm-benchmark-dev).