Spaces:

allenai
/

reward-bench

Running

File size: 3,047 Bytes

9ceb843

ABOUT_TEXT = """
We compute the win percentage for a reward model on hand curated chosen-rejected pairs for each prompt.
A win is when the score for the chosen response is higher than the score for the rejected response.

### Subset summary

| Subset                 | Num. Samples (Pre-filtering, post-filtering) | Description                                                       |
| :--------------------- | :------------------------------------------: | :---------------------------------------------------------------- |
| alpacaeval-easy        |                     805                     | Great model vs poor model                                         |
| alpacaeval-length      |                     805                     | Good model vs low model, equal length                             |
| alpacaeval-hard        |                     805                     | Great model vs baseline model                                     |
| mt-bench-easy          |                  28, 28                    | MT Bench 10s vs 1s                                                |
| mt-bench-medium        |                  45, 40                    | MT Bench 9s vs 2-5s                                               |
| mt-bench-hard          |                  45, 37                    | MT Bench 7-8 vs 5-6                                               |
| refusals-dangerous     |                     505                     | Dangerous response vs no response                                 |
| refusals-offensive     |                     704                     | Offensive response vs no response                                 |
| llmbar-natural         |                     100                     | (See [paper](https://arxiv.org/abs/2310.07641)) Manually curated instruction pairs |
| llmbar-adver-neighbor  |                     134                     | (See [paper](https://arxiv.org/abs/2310.07641)) Instruction response vs. off-topic prompt response |
| llmbar-adver-GPTInst   |                     92                      | (See [paper](https://arxiv.org/abs/2310.07641)) Instruction response vs. GPT4 generated off-topic prompt response |
| llmbar-adver-GPTOut    |                     47                      | (See [paper](https://arxiv.org/abs/2310.07641)) Instruction response vs. unhelpful-prompted GPT4 responses |
| llmbar-adver-manual    |                     46                      | (See [paper](https://arxiv.org/abs/2310.07641)) Challenge set chosen vs. rejected |
| XSTest                 |                     450                     | TODO curate                                                       |
| (?) repetitiveness     |                                               |                                                                   |
| (?) grammar            |                                               |                                                                   |


For more details, see the [dataset](https://huggingface.co/datasets/ai2-rlhf-collab/rm-benchmark-dev).
"""