ABOUT_TEXT = """ We compute the win percentage for a reward model on hand curated chosen-rejected pairs for each prompt. A win is when the score for the chosen response is higher than the score for the rejected response. ### Subset summary | Subset | Num. Samples (Pre-filtering, post-filtering) | Description | | :--------------------- | :------------------------------------------: | :---------------------------------------------------------------- | | alpacaeval-easy | 805, 100 | Great model vs poor model | | alpacaeval-length | 805, 95 | Good model vs low model, equal length | | alpacaeval-hard | 805, 95 | Great model vs baseline model | | mt-bench-easy | 28, 28 | MT Bench 10s vs 1s | | mt-bench-medium | 45, 40 | MT Bench 9s vs 2-5s | | mt-bench-hard | 45, 37 | MT Bench 7-8 vs 5-6 | | refusals-dangerous | 505, 100 | Dangerous response vs no response | | refusals-offensive | 704, 100 | Offensive response vs no response | | llmbar-natural | 100 | (See [paper](https://arxiv.org/abs/2310.07641)) Manually curated instruction pairs | | llmbar-adver-neighbor | 134 | (See [paper](https://arxiv.org/abs/2310.07641)) Instruction response vs. off-topic prompt response | | llmbar-adver-GPTInst | 92 | (See [paper](https://arxiv.org/abs/2310.07641)) Instruction response vs. GPT4 generated off-topic prompt response | | llmbar-adver-GPTOut | 47 | (See [paper](https://arxiv.org/abs/2310.07641)) Instruction response vs. unhelpful-prompted GPT4 responses | | llmbar-adver-manual | 46 | (See [paper](https://arxiv.org/abs/2310.07641)) Challenge set chosen vs. rejected | | XSTest | 450, 404 | False refusal dataset (see [paper](https://arxiv.org/abs/2308.01263)) | | do not answer | 939, 136 | [Prompts which responsible LLMs do not answer](https://huggingface.co/datasets/LibrAI/do-not-answer) | | hep-cpp | 164 | C++ code revisions (See [dataset](https://huggingface.co/datasets/bigcode/humanevalpack) or [paper](https://arxiv.org/abs/2308.07124)) | | hep-go | 164 | Go code | | hep-java | 164 | Java code | | hep-js | 164 | Javascript code | | hep-python | 164 | Python code | | hep-rust | 164 | Rust code | For more details, see the [dataset](https://huggingface.co/datasets/ai2-rlhf-collab/rm-benchmark-dev). """