weqweasdas
/

hh_rlhf_rm_open_llama_3b

Text Classification

text-generation-inference

Inference Endpoints

Model card Files Files and versions Community

weqweasdas commited on Jul 26, 2023

Commit

5015d45

·

1 Parent(s): 23cf908

Update README.md

Files changed (1) hide show

README.md +8 -1

README.md CHANGED Viewed

@@ -8,7 +8,7 @@
 <!-- Provide a quick summary of what the model is/does. -->
-In this repo, we present a reward model trained by the framework [LMFlow](https://github.com/OptimalScale/LMFlow). The reward model isfor the [HH-RLHF dataset](Dahoas/full-hh-rlhf), and is trained from the base model [openlm-research/open_llama_3b](https://huggingface.co/openlm-research/open_llama_3b).
 ## Model Details
@@ -65,6 +65,13 @@ We use bf16 and do not use LoRA in both of the stages.
 <!-- Address questions around how the model is intended to be used, including the foreseeable users of the model and those affected by the model. -->
 ## Reference

 <!-- Provide a quick summary of what the model is/does. -->
+In this repo, we present a reward model trained by the framework [LMFlow](https://github.com/OptimalScale/LMFlow). The reward model is for the [HH-RLHF dataset](Dahoas/full-hh-rlhf), and is trained from the base model [openlm-research/open_llama_3b](https://huggingface.co/openlm-research/open_llama_3b).
 ## Model Details
 <!-- Address questions around how the model is intended to be used, including the foreseeable users of the model and those affected by the model. -->
+### RAFT Example
+We test the reward model by the [RAFT (Reward ranked finetuning)](https://arxiv.org/pdf/2304.06767.pdf) and with EleutherAI/gpt-neo-2.7B as the starting checkpoint.
+For each iteration, we sample 2048 prompts from the HH-RLHF dataset, and for each prompt, we generate K=8 responses by the current model, and pick the response with the highest reward. Then, we finetune the model on this picked set to get the new model. We report the learning curve as follows:
+![Reward Curve of RAFT](raft.png)
 ## Reference