Spaces:

allenai
/

reward-bench

Running

App Files Files Community

natolambert commited on Feb 15, 2024

Commit

65e180d

1 Parent(s): b7aaef4

details

Browse files

Files changed (1) hide show

src/md.py +15 -5

src/md.py CHANGED Viewed

@@ -2,13 +2,23 @@ ABOUT_TEXT = """
 We compute the win percentage for a reward model on hand curated chosen-rejected pairs for each prompt.
 A win is when the score for the chosen response is higher than the score for the rejected response.
 We average over 4 core sections (per prompt weighting):
-1. Chat: Includes the easy chat subsets (alpacaeval-easy, alpacaeval-length, alpacaeval-hard, mt-bench-easy, mt-bench-medium)
-2. Chat Hard: Includes the hard chat subsets (mt-bench-hard, llmbar-natural, llmbar-adver-neighbor, llmbar-adver-GPTInst, llmbar-adver-GPTOut, llmbar-adver-manual)
-3. Safety: Includes the safety subsets (refusals-dangerous, refusals-offensive, xstest-should-refuse, xstest-should-respond, do not answer)
-4. Code: Includes the code subsets (hep-cpp, hep-go, hep-java, hep-js, hep-python, hep-rust)
-## Subset Summary
 Total number of the prompts is: 2538, filtered from 4676.

 We compute the win percentage for a reward model on hand curated chosen-rejected pairs for each prompt.
 A win is when the score for the chosen response is higher than the score for the rejected response.
+## Overview
 We average over 4 core sections (per prompt weighting):
+1. **Chat**: Includes the easy chat subsets (alpacaeval-easy, alpacaeval-length, alpacaeval-hard, mt-bench-easy, mt-bench-medium)
+2. **Chat Hard**: Includes the hard chat subsets (mt-bench-hard, llmbar-natural, llmbar-adver-neighbor, llmbar-adver-GPTInst, llmbar-adver-GPTOut, llmbar-adver-manual)
+3. **Safety**: Includes the safety subsets (refusals-dangerous, refusals-offensive, xstest-should-refuse, xstest-should-respond, do not answer)
+4. **Code**: Includes the code subsets (hep-cpp, hep-go, hep-java, hep-js, hep-python, hep-rust)
+We include multiple types of reward models in this evaluation:
+1. **Sequence Classifiers** (Seq. Classifier): A model, normally trained with HuggingFace AutoModelForSequenceClassification, that takes in a prompt and a response and outputs a score.
+2. **Custom Classifiers**: Research models with different architectures and training objectives to either take in two inputs at once or generate scores differently (e.g. PairRM and Stanford SteamSHP).
+3. **DPO**: Models trained with Direct Preference Optimization (DPO), with modifiers such as `-ref-free` or `-norm` changing how scores are computed.
+4. **Random**: Random choice baseline.
+Others, such as **Generative Judge** are coming soon.
+### Subset Details
 Total number of the prompts is: 2538, filtered from 4676.