weqweasdas
commited on
Update README.md
Browse files
README.md
CHANGED
@@ -29,8 +29,9 @@ The model is trained on a mixture of the following datasets. We also provide the
|
|
29 |
- [HelpSteer](https://huggingface.co/datasets/nvidia/HelpSteer)
|
30 |
- [Orca](argilla/distilabel-intel-orca-dpo-pairs)
|
31 |
|
32 |
-
Difference between this mixture and
|
33 |
|
|
|
34 |
- SHP: we only use the samples with score ratio > 2, for each prompt, we take 5 comparison at most, leading to 109526;
|
35 |
- Ultrafeedback: similar to UltraFeedback-Binarized, we use the fine-grained score instead of the overall one to rank samples. Meanwhile, for each prompt, we take all possible 6 pairs of comparisons. Finally, we delete the selected pairs with equal scores, leading to 267416.
|
36 |
- HelpSteer: we use the mean of helpfulness and correctness to rank samples. Meanwhile, we take all possible 6 pairs of comparisons. Finally, we delete the selected pairs with equal scores, leading to 21576;
|
|
|
29 |
- [HelpSteer](https://huggingface.co/datasets/nvidia/HelpSteer)
|
30 |
- [Orca](argilla/distilabel-intel-orca-dpo-pairs)
|
31 |
|
32 |
+
Difference between this mixture and the original dataset
|
33 |
|
34 |
+
- HH-RLHF: we only use the helpful subset and we delete the noisy samples where chosen_response == rejected_response;
|
35 |
- SHP: we only use the samples with score ratio > 2, for each prompt, we take 5 comparison at most, leading to 109526;
|
36 |
- Ultrafeedback: similar to UltraFeedback-Binarized, we use the fine-grained score instead of the overall one to rank samples. Meanwhile, for each prompt, we take all possible 6 pairs of comparisons. Finally, we delete the selected pairs with equal scores, leading to 267416.
|
37 |
- HelpSteer: we use the mean of helpfulness and correctness to rank samples. Meanwhile, we take all possible 6 pairs of comparisons. Finally, we delete the selected pairs with equal scores, leading to 21576;
|