Spaces:

allenai
/

reward-bench

Running

App Files Files Community

reward-bench / src /md.py

natolambert

Update src/md.py

702ff77 verified 9 months ago

raw

history blame

3.36 kB

	ABOUT_TEXT = """
	We compute the win percentage for a reward model on hand curated chosen-rejected pairs for each prompt.
	A win is when the score for the chosen response is higher than the score for the rejected response.

	### Subset summary

	\| Subset \| Num. Samples (Pre-filtering, post-filtering) \| Description \|
	\| :--------------------- \| :------------------------------------------: \| :---------------------------------------------------------------- \|
	\| alpacaeval-easy \| 805, 100 \| Great model vs poor model \|
	\| alpacaeval-length \| 805, 95 \| Good model vs low model, equal length \|
	\| alpacaeval-hard \| 805, 95 \| Great model vs baseline model \|
	\| mt-bench-easy \| 28, 28 \| MT Bench 10s vs 1s \|
	\| mt-bench-medium \| 45, 40 \| MT Bench 9s vs 2-5s \|
	\| mt-bench-hard \| 45, 37 \| MT Bench 7-8 vs 5-6 \|
	\| refusals-dangerous \| 505, 100 \| Dangerous response vs no response \|
	\| refusals-offensive \| 704, 100 \| Offensive response vs no response \|
	\| llmbar-natural \| 100 \| (See [paper](https://arxiv.org/abs/2310.07641)) Manually curated instruction pairs \|
	\| llmbar-adver-neighbor \| 134 \| (See [paper](https://arxiv.org/abs/2310.07641)) Instruction response vs. off-topic prompt response \|
	\| llmbar-adver-GPTInst \| 92 \| (See [paper](https://arxiv.org/abs/2310.07641)) Instruction response vs. GPT4 generated off-topic prompt response \|
	\| llmbar-adver-GPTOut \| 47 \| (See [paper](https://arxiv.org/abs/2310.07641)) Instruction response vs. unhelpful-prompted GPT4 responses \|
	\| llmbar-adver-manual \| 46 \| (See [paper](https://arxiv.org/abs/2310.07641)) Challenge set chosen vs. rejected \|
	\| XSTest \| 450, 404 \| False refusal dataset (see [paper](https://arxiv.org/abs/2308.01263)) \|
	\| do not answer \| 939, 136 \| [Prompts which responsible LLMs do not answer](https://huggingface.co/datasets/LibrAI/do-not-answer) \|
	\| hep-cpp \| 164 \| C++ code revisions (See [dataset](https://huggingface.co/datasets/bigcode/humanevalpack) or [paper](https://arxiv.org/abs/2308.07124)) \|
	\| hep-go \| 164 \| Go code \|
	\| hep-java \| 164 \| Java code \|
	\| hep-js \| 164 \| Javascript code \|
	\| hep-python \| 164 \| Python code \|
	\| hep-rust \| 164 \| Rust code \|


	For more details, see the [dataset](https://huggingface.co/datasets/ai2-rlhf-collab/rm-benchmark-dev).
	"""

	ABOUT_TEXT = """
	We compute the win percentage for a reward model on hand curated chosen-rejected pairs for each prompt.
	A win is when the score for the chosen response is higher than the score for the rejected response.

	### Subset summary

	\| Subset \| Num. Samples (Pre-filtering, post-filtering) \| Description \|
	\| :--------------------- \| :------------------------------------------: \| :---------------------------------------------------------------- \|
	\| alpacaeval-easy \| 805, 100 \| Great model vs poor model \|
	\| alpacaeval-length \| 805, 95 \| Good model vs low model, equal length \|
	\| alpacaeval-hard \| 805, 95 \| Great model vs baseline model \|
	\| mt-bench-easy \| 28, 28 \| MT Bench 10s vs 1s \|
	\| mt-bench-medium \| 45, 40 \| MT Bench 9s vs 2-5s \|
	\| mt-bench-hard \| 45, 37 \| MT Bench 7-8 vs 5-6 \|
	\| refusals-dangerous \| 505, 100 \| Dangerous response vs no response \|
	\| refusals-offensive \| 704, 100 \| Offensive response vs no response \|
	\| llmbar-natural \| 100 \| (See [paper](https://arxiv.org/abs/2310.07641)) Manually curated instruction pairs \|
	\| llmbar-adver-neighbor \| 134 \| (See [paper](https://arxiv.org/abs/2310.07641)) Instruction response vs. off-topic prompt response \|
	\| llmbar-adver-GPTInst \| 92 \| (See [paper](https://arxiv.org/abs/2310.07641)) Instruction response vs. GPT4 generated off-topic prompt response \|
	\| llmbar-adver-GPTOut \| 47 \| (See [paper](https://arxiv.org/abs/2310.07641)) Instruction response vs. unhelpful-prompted GPT4 responses \|
	\| llmbar-adver-manual \| 46 \| (See [paper](https://arxiv.org/abs/2310.07641)) Challenge set chosen vs. rejected \|
	\| XSTest \| 450, 404 \| False refusal dataset (see [paper](https://arxiv.org/abs/2308.01263)) \|
	\| do not answer \| 939, 136 \| [Prompts which responsible LLMs do not answer](https://huggingface.co/datasets/LibrAI/do-not-answer) \|
	\| hep-cpp \| 164 \| C++ code revisions (See [dataset](https://huggingface.co/datasets/bigcode/humanevalpack) or [paper](https://arxiv.org/abs/2308.07124)) \|
	\| hep-go \| 164 \| Go code \|
	\| hep-java \| 164 \| Java code \|
	\| hep-js \| 164 \| Javascript code \|
	\| hep-python \| 164 \| Python code \|
	\| hep-rust \| 164 \| Rust code \|


	For more details, see the [dataset](https://huggingface.co/datasets/ai2-rlhf-collab/rm-benchmark-dev).
	"""