weqweasdas
/

hh_rlhf_rm_open_llama_3b

Text Classification

text-generation-inference

Inference Endpoints

Model card Files Files and versions Community

weqweasdas commited on Aug 4, 2023

Commit

60ccec5

•

1 Parent(s): 4baeddc

Update README.md

Files changed (1) hide show

README.md +6 -1

README.md CHANGED Viewed

@@ -71,7 +71,12 @@ We test the reward model by the [RAFT (Reward ranked finetuning)](https://arxiv.
 For each iteration, we sample 2048 prompts from the HH-RLHF dataset, and for each prompt, we generate K=8 responses by the current model, and pick the response with the highest reward. Then, we finetune the model on this picked set to get the new model. We report the learning curve as follows:
-![Reward Curve of RAFT](raft.png)
 ## Reference

 For each iteration, we sample 2048 prompts from the HH-RLHF dataset, and for each prompt, we generate K=8 responses by the current model, and pick the response with the highest reward. Then, we finetune the model on this picked set to get the new model. We report the learning curve as follows:
+![Reward Curve of RAFT with GPT-Neo-2.7B](raft.png)
+We also perform the experiment with the LLaMA-7B model but we first fine-tune the base model using the chosen responses in the HH-RLHF dataset for 1 epoch with learning rate 2e-5. The hyper-parameters for RAFT are the same with the GPT-Neo-2.7B and the reward curves are presented as follows:
+![Reward Curve of RAFT with LLaMA-7B](llama_reward.png)
 ## Reference