banghua commited on
Commit
36eccc9
1 Parent(s): 9dcf3eb

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +29 -9
README.md CHANGED
@@ -14,7 +14,7 @@ tags:
14
 
15
  <!-- Provide a quick summary of what the model is/does. -->
16
 
17
- Starling-RM-7B-alpha is a reward model trained from [Llama2-7B-Chat](https://huggingface.co/meta-llama/Llama-2-7b-chat-hf). Following the tradition of training reward model in the instructGPT paper, we remove the last layer of Llama2-7B Chat,
18
  and concatenate a linear layer that outputs scalar for any pair of input prompt and response. We train the reward model with preference dataset [berkeley-nest/Nectar](https://huggingface.co/berkeley-nest),
19
  with the K-wise maximum likelihood estimator proposed in [this paper](https://arxiv.org/abs/2301.11270). The reward model outputs a scalar for any given prompt and response. A response that is more helpful and
20
  less harmful will get the highest reward score. Note that since the preference dataset [berkeley-nest/Nectar](https://huggingface.co/berkeley-nest) is based on GPT-4 preference, the reward model is likely to be biased
@@ -27,24 +27,44 @@ towards GPT-4's own preference, including longer responses and certain response
27
  - **License:** Non commercial license
28
  - **Finetuned from model:** [Llama2-7B-Chat](https://huggingface.co/meta-llama/Llama-2-7b-chat-hf)
29
 
30
- ### Model Sources [optional]
 
31
 
32
  <!-- Provide the basic links for the model. -->
33
 
34
  - **Blog:** https://starling.cs.berkeley.edu/
35
- - **Paper [optional]:** Coming soon!
36
- - **Code [optional]:** Coming soon!
37
-
38
  ## Uses
39
 
40
  <!-- Address questions around how the model is intended to be used, including the foreseeable users of the model and those affected by the model. -->
 
 
 
 
 
 
 
 
41
 
42
 
 
 
43
 
44
- ## Citation [optional]
45
 
46
- <!-- If there is a paper or blog post introducing the model, the APA and Bibtex information for that should go in this section. -->
 
47
 
48
- **BibTeX:**
49
 
50
- [More Information Needed]
 
 
 
 
 
 
 
 
 
 
14
 
15
  <!-- Provide a quick summary of what the model is/does. -->
16
 
17
+ Starling-RM-7B-alpha is a reward model trained from [Llama2-7B-Chat](https://huggingface.co/meta-llama/Llama-2-7b-chat-hf). Following the method of training reward model in [the instructGPT paper](https://arxiv.org/abs/2203.02155), we remove the last layer of Llama2-7B Chat,
18
  and concatenate a linear layer that outputs scalar for any pair of input prompt and response. We train the reward model with preference dataset [berkeley-nest/Nectar](https://huggingface.co/berkeley-nest),
19
  with the K-wise maximum likelihood estimator proposed in [this paper](https://arxiv.org/abs/2301.11270). The reward model outputs a scalar for any given prompt and response. A response that is more helpful and
20
  less harmful will get the highest reward score. Note that since the preference dataset [berkeley-nest/Nectar](https://huggingface.co/berkeley-nest) is based on GPT-4 preference, the reward model is likely to be biased
 
27
  - **License:** Non commercial license
28
  - **Finetuned from model:** [Llama2-7B-Chat](https://huggingface.co/meta-llama/Llama-2-7b-chat-hf)
29
 
30
+
31
+ ### Model Sources
32
 
33
  <!-- Provide the basic links for the model. -->
34
 
35
  - **Blog:** https://starling.cs.berkeley.edu/
36
+ - **Paper:** Coming soon!
37
+ - **Code:** Coming soon!
38
+ -
39
  ## Uses
40
 
41
  <!-- Address questions around how the model is intended to be used, including the foreseeable users of the model and those affected by the model. -->
42
+ Please use the following code for running inference with the reward model.
43
+
44
+ ```
45
+ # Load in the reward model
46
+
47
+
48
+ ```
49
+
50
 
51
 
52
+ ## License
53
+ The dataset, model and online demo is a research preview intended for non-commercial use only, subject to the data distillation [License](https://github.com/facebookresearch/llama/blob/main/MODEL_CARD.md) of LLaMA, [Terms of Use](https://openai.com/policies/terms-of-use) of the data generated by OpenAI, and [Privacy Practices](https://chrome.google.com/webstore/detail/sharegpt-share-your-chatg/daiacboceoaocpibfodeljbdfacokfjb) of ShareGPT. Please contact us if you find any potential violation.
54
 
 
55
 
56
+ ## Acknowledgment
57
+ We would like to thank Wei-Lin Chiang from Berkeley for detailed feedback of the blog and the projects. We would like to thank the [LMSYS Organization](https://lmsys.org/) for their support of [lmsys-chat-1M](https://huggingface.co/datasets/lmsys/lmsys-chat-1m) dataset, evaluation and online demo. We would like to thank the open source community for their efforts in providing the datasets and base models we used to develope the project, including but not limited to Anthropic, Llama, Mistral, Hugging Face H4, LMSYS, OpenChat, OpenBMB, Flan and ShareGPT.
58
 
59
+ **✉ Correspondence to:** Banghua Zhu (banghua@berkeley.edu).
60
 
61
+ ## Citation
62
+ ```
63
+ @misc{starling2023,
64
+ title = {Starling-7B: Improving LLM Helpfulness & Harmlessness with RLAIF},
65
+ url = {},
66
+ author = {Zhu, Banghua and Frick, Evan and Wu, Tianhao and Zhu, Hanlin and Jiao, Jiantao},
67
+ month = {November},
68
+ year = {2023}
69
+ }
70
+ ```