Update README.md
Browse files
README.md
CHANGED
@@ -22,6 +22,16 @@ This is a 70B reward model used for PPO training trained on the UltraFeedback da
|
|
22 |
For more details, read the paper:
|
23 |
[Unpacking DPO and PPO: Disentangling Best Practices for Learning from Preference Feedback](https://arxiv.org/abs/2406.09279).
|
24 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
25 |
|
26 |
## .Model description
|
27 |
|
|
|
22 |
For more details, read the paper:
|
23 |
[Unpacking DPO and PPO: Disentangling Best Practices for Learning from Preference Feedback](https://arxiv.org/abs/2406.09279).
|
24 |
|
25 |
+
## Performance
|
26 |
+
|
27 |
+
We evaluate the model on [RewardBench](https://github.com/allenai/reward-bench):
|
28 |
+
|
29 |
+
| Model | Score | Chat | Chat Hard | Safety | Reasoning | Prior Sets (0.5 weight) |
|
30 |
+
|------------------|-------|-------|-----------|--------|-----------|-------------------------|
|
31 |
+
| [Llama 3 Tulu 2 8b UF RM](https://huggingface.co/allenai/llama-3-tulu-2-8b-uf-mean-rm) | 66.3 | 96.6 | 59.4 | 61.4 | 80.7 | |
|
32 |
+
| **[Llama 3 Tulu 2 70b UF RM](https://huggingface.co/allenai/llama-3-tulu-2-70b-uf-mean-rm) (this model)** | | | | | | |
|
33 |
+
|
34 |
+
|
35 |
|
36 |
## .Model description
|
37 |
|