metadata

license: cc-by-nc-4.0
datasets:
  - berkeley-nest/Nectar
language:
  - en
library_name: transformers
tags:
  - reward model
  - RLHF
  - RLAIF

Model Card for Starling-RM-7B-alpha

Starling-RM-7B-alpha is a reward model trained from Llama2-7B-Chat. Following the tradition of training reward model in the instructGPT paper, we remove the last layer of Llama2-7B Chat, and concatenate a linear layer that outputs scalar for any pair of input prompt and response. We train the reward model with preference dataset berkeley-nest/Nectar, with the K-wise maximum likelihood estimator proposed in this paper. The reward model outputs a scalar for any given prompt and response. A response that is more helpful and less harmful will get the highest reward score. Note that since the preference dataset berkeley-nest/Nectar is based on GPT-4 preference, the reward model is likely to be biased towards GPT-4's own preference, including longer responses and certain response format.

Developed by: Banghua Zhu * , Evan Frick * , Tianhao Wu * , Hanlin Zhu and Jiantao Jiao.
Model type: Reward Model for RLHF
License: Non commercial license
Finetuned from model: Llama2-7B-Chat

Model Sources [optional]

Blog: https://starling.cs.berkeley.edu/
Paper [optional]: Coming soon!
Code [optional]: Coming soon!

Uses

Citation [optional]

BibTeX:

[More Information Needed]