banghua's picture
Update README.md
9dcf3eb
|
raw
history blame
No virus
2.04 kB
metadata
license: cc-by-nc-4.0
datasets:
  - berkeley-nest/Nectar
language:
  - en
library_name: transformers
tags:
  - reward model
  - RLHF
  - RLAIF

Model Card for Starling-RM-7B-alpha

Starling-RM-7B-alpha is a reward model trained from Llama2-7B-Chat. Following the tradition of training reward model in the instructGPT paper, we remove the last layer of Llama2-7B Chat, and concatenate a linear layer that outputs scalar for any pair of input prompt and response. We train the reward model with preference dataset berkeley-nest/Nectar, with the K-wise maximum likelihood estimator proposed in this paper. The reward model outputs a scalar for any given prompt and response. A response that is more helpful and less harmful will get the highest reward score. Note that since the preference dataset berkeley-nest/Nectar is based on GPT-4 preference, the reward model is likely to be biased towards GPT-4's own preference, including longer responses and certain response format.

  • Developed by: Banghua Zhu * , Evan Frick * , Tianhao Wu * , Hanlin Zhu and Jiantao Jiao.
  • Model type: Reward Model for RLHF
  • License: Non commercial license
  • Finetuned from model: Llama2-7B-Chat

Model Sources [optional]

Uses

Citation [optional]

BibTeX:

[More Information Needed]