weqweasdas commited on
Commit
60ccec5
1 Parent(s): 4baeddc

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +6 -1
README.md CHANGED
@@ -71,7 +71,12 @@ We test the reward model by the [RAFT (Reward ranked finetuning)](https://arxiv.
71
 
72
  For each iteration, we sample 2048 prompts from the HH-RLHF dataset, and for each prompt, we generate K=8 responses by the current model, and pick the response with the highest reward. Then, we finetune the model on this picked set to get the new model. We report the learning curve as follows:
73
 
74
- ![Reward Curve of RAFT](raft.png)
 
 
 
 
 
75
 
76
 
77
  ## Reference
 
71
 
72
  For each iteration, we sample 2048 prompts from the HH-RLHF dataset, and for each prompt, we generate K=8 responses by the current model, and pick the response with the highest reward. Then, we finetune the model on this picked set to get the new model. We report the learning curve as follows:
73
 
74
+ ![Reward Curve of RAFT with GPT-Neo-2.7B](raft.png)
75
+
76
+ We also perform the experiment with the LLaMA-7B model but we first fine-tune the base model using the chosen responses in the HH-RLHF dataset for 1 epoch with learning rate 2e-5. The hyper-parameters for RAFT are the same with the GPT-Neo-2.7B and the reward curves are presented as follows:
77
+
78
+ ![Reward Curve of RAFT with LLaMA-7B](llama_reward.png)
79
+
80
 
81
 
82
  ## Reference