metadata

license: cc-by-nc-4.0
datasets:
  - berkeley-nest/Nectar
language:
  - en
library_name: transformers
tags:
  - reward model
  - RLHF
  - RLAIF

Model Card for Starling-RM-7B-alpha

Starling-RM-7B-alpha is a reward model trained from Llama2-7B-Chat. Following the method of training reward model in the instructGPT paper, we remove the last layer of Llama2-7B Chat, and concatenate a linear layer that outputs scalar for any pair of input prompt and response. We train the reward model with preference dataset berkeley-nest/Nectar, with the K-wise maximum likelihood estimator proposed in this paper. The reward model outputs a scalar for any given prompt and response. A response that is more helpful and less harmful will get the highest reward score. Note that since the preference dataset berkeley-nest/Nectar is based on GPT-4 preference, the reward model is likely to be biased towards GPT-4's own preference, including longer responses and certain response format.

For more detailed discussions, please check out our blog post, and stay tuned for our upcoming code and paper!

Developed by: Banghua Zhu * , Evan Frick * , Tianhao Wu * , Hanlin Zhu and Jiantao Jiao.
Model type: Reward Model for RLHF
License: Non commercial license
Finetuned from model: Llama2-7B-Chat

Model Sources

Blog: https://starling.cs.berkeley.edu/
Paper: Coming soon!
Code: Coming soon!

Uses

Please use the following code for inference with the reward model.

## Define the reward model function class
Test.

License

The dataset, model and online demo is a research preview intended for non-commercial use only, subject to the data distillation License of LLaMA, Terms of Use of the data generated by OpenAI, and Privacy Practices of ShareGPT. Please contact us if you find any potential violation.

Acknowledgment

We would like to thank Wei-Lin Chiang from Berkeley for detailed feedback of the blog and the projects. We would like to thank the LMSYS Organization for their support of lmsys-chat-1M dataset, evaluation and online demo. We would like to thank the open source community for their efforts in providing the datasets and base models we used to develope the project, including but not limited to Anthropic, Llama, Mistral, Hugging Face H4, LMSYS, OpenChat, OpenBMB, Flan and ShareGPT.

✉ Correspondence to: Banghua Zhu (banghua@berkeley.edu).

Citation

@misc{starling2023,
    title = {Starling-7B: Improving LLM Helpfulness & Harmlessness with RLAIF},
    url = {},
    author = {Zhu, Banghua and Frick, Evan and Wu, Tianhao and Zhu, Hanlin and Jiao, Jiantao},
    month = {November},
    year = {2023}
}