sileod commited on
Commit
683961f
1 Parent(s): e226218

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +2 -0
README.md CHANGED
@@ -23,4 +23,6 @@ model-index:
23
  # Reward model based `deberta-v3-large-tasksource-nli` fine-tuned on Anthropic/hh-rlhf
24
  For 1 epoch with 1e-5 learning rate.
25
 
 
 
26
  Validation accuracy is currently the best publicly available reported: 75.16% (vs 69.25% for `OpenAssistant/reward-model-deberta-v3-large-v2`).
 
23
  # Reward model based `deberta-v3-large-tasksource-nli` fine-tuned on Anthropic/hh-rlhf
24
  For 1 epoch with 1e-5 learning rate.
25
 
26
+ The data are described in the paper: [Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback](https://arxiv.org/abs/2204.05862).
27
+
28
  Validation accuracy is currently the best publicly available reported: 75.16% (vs 69.25% for `OpenAssistant/reward-model-deberta-v3-large-v2`).