weqweasdas commited on
Commit
5015d45
1 Parent(s): 23cf908

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +8 -1
README.md CHANGED
@@ -8,7 +8,7 @@
8
 
9
  <!-- Provide a quick summary of what the model is/does. -->
10
 
11
- In this repo, we present a reward model trained by the framework [LMFlow](https://github.com/OptimalScale/LMFlow). The reward model isfor the [HH-RLHF dataset](Dahoas/full-hh-rlhf), and is trained from the base model [openlm-research/open_llama_3b](https://huggingface.co/openlm-research/open_llama_3b).
12
 
13
  ## Model Details
14
 
@@ -65,6 +65,13 @@ We use bf16 and do not use LoRA in both of the stages.
65
 
66
  <!-- Address questions around how the model is intended to be used, including the foreseeable users of the model and those affected by the model. -->
67
 
 
 
 
 
 
 
 
68
 
69
 
70
  ## Reference
 
8
 
9
  <!-- Provide a quick summary of what the model is/does. -->
10
 
11
+ In this repo, we present a reward model trained by the framework [LMFlow](https://github.com/OptimalScale/LMFlow). The reward model is for the [HH-RLHF dataset](Dahoas/full-hh-rlhf), and is trained from the base model [openlm-research/open_llama_3b](https://huggingface.co/openlm-research/open_llama_3b).
12
 
13
  ## Model Details
14
 
 
65
 
66
  <!-- Address questions around how the model is intended to be used, including the foreseeable users of the model and those affected by the model. -->
67
 
68
+ ### RAFT Example
69
+
70
+ We test the reward model by the [RAFT (Reward ranked finetuning)](https://arxiv.org/pdf/2304.06767.pdf) and with EleutherAI/gpt-neo-2.7B as the starting checkpoint.
71
+
72
+ For each iteration, we sample 2048 prompts from the HH-RLHF dataset, and for each prompt, we generate K=8 responses by the current model, and pick the response with the highest reward. Then, we finetune the model on this picked set to get the new model. We report the learning curve as follows:
73
+
74
+ ![Reward Curve of RAFT](raft.png)
75
 
76
 
77
  ## Reference