license: cc-by-nc-4.0
datasets:
- berkeley-nest/Nectar
language:
- en
library_name: transformers
tags:
- reward model
- RLHF
- RLAIF
Model Card for Starling-RM-7B-alpha
Starling-RM-7B-alpha is a reward model trained from Llama2-7B-Chat. Following the tradition of training reward model in the instructGPT paper, we remove the last layer of Llama2-7B Chat, and concatenate a linear layer that outputs scalar for any pair of input prompt and response. We train the reward model with preference dataset berkeley-nest/Nectar, with the K-wise maximum likelihood estimator proposed in this paper. The reward model outputs a scalar for any given prompt and response. A response that is more helpful and less harmful will get the highest reward score. Note that since the preference dataset berkeley-nest/Nectar is based on GPT-4 preference, the reward model is likely to be biased towards GPT-4's own preference, including longer responses and certain response format.
- Developed by: Banghua Zhu * , Evan Frick * , Tianhao Wu * , Hanlin Zhu and Jiantao Jiao.
- Model type: Reward Model for RLHF
- License: Non commercial license
- Finetuned from model: Llama2-7B-Chat
Model Sources [optional]
- Blog: https://starling.cs.berkeley.edu/
- Paper [optional]: Coming soon!
- Code [optional]: Coming soon!
Uses
Citation [optional]
BibTeX:
[More Information Needed]