allenai
/

llama-3-tulu-2-70b-uf-mean-rm

Text Classification

text-generation-inference

Inference Endpoints

Model card Files Files and versions Community

hamishivi commited on Jun 21, 2024

Commit

8921e92

·

verified ·

1 Parent(s): d53ea25

Update README.md

Files changed (1) hide show

README.md +10 -0

README.md CHANGED Viewed

@@ -22,6 +22,16 @@ This is a 70B reward model used for PPO training trained on the UltraFeedback da
 For more details, read the paper:
 [Unpacking DPO and PPO: Disentangling Best Practices for Learning from Preference Feedback](https://arxiv.org/abs/2406.09279).
 ## .Model description

 For more details, read the paper:
 [Unpacking DPO and PPO: Disentangling Best Practices for Learning from Preference Feedback](https://arxiv.org/abs/2406.09279).
+## Performance
+We evaluate the model on [RewardBench](https://github.com/allenai/reward-bench):
+| Model            | Score | Chat  | Chat Hard | Safety | Reasoning | Prior Sets (0.5 weight) |
+|------------------|-------|-------|-----------|--------|-----------|-------------------------|
+| [Llama 3 Tulu 2 8b UF RM](https://huggingface.co/allenai/llama-3-tulu-2-8b-uf-mean-rm)  | 66.3  | 96.6  |    59.4   |  61.4  |    80.7   |                         |
+| **[Llama 3 Tulu 2 70b UF RM](https://huggingface.co/allenai/llama-3-tulu-2-70b-uf-mean-rm) (this model)** |       |       |           |        |           |                         |
 ## .Model description