--- license: cc-by-nc-4.0 datasets: - berkeley-nest/Nectar language: - en library_name: transformers tags: - reward model - RLHF - RLAIF --- # Model Card for Starling-RM-7B-alpha Starling-RM-7B-alpha is a reward model trained from [Llama2-7B-Chat](https://huggingface.co/meta-llama/Llama-2-7b-chat-hf). Following the tradition of training reward model in the instructGPT paper, we remove the last layer of Llama2-7B Chat, and concatenate a linear layer that outputs scalar for any pair of input prompt and response. We train the reward model with preference dataset [berkeley-nest/Nectar](https://huggingface.co/berkeley-nest), with the K-wise maximum likelihood estimator proposed in [this paper](https://arxiv.org/abs/2301.11270). The reward model outputs a scalar for any given prompt and response. A response that is more helpful and less harmful will get the highest reward score. Note that since the preference dataset [berkeley-nest/Nectar](https://huggingface.co/berkeley-nest) is based on GPT-4 preference, the reward model is likely to be biased towards GPT-4's own preference, including longer responses and certain response format. - **Developed by:** Banghua Zhu * , Evan Frick * , Tianhao Wu * , Hanlin Zhu and Jiantao Jiao. - **Model type:** Reward Model for RLHF - **License:** Non commercial license - **Finetuned from model:** [Llama2-7B-Chat](https://huggingface.co/meta-llama/Llama-2-7b-chat-hf) ### Model Sources [optional] - **Blog:** https://starling.cs.berkeley.edu/ - **Paper [optional]:** Coming soon! - **Code [optional]:** Coming soon! ## Uses ## Citation [optional] **BibTeX:** [More Information Needed]