berkeley-nest
/

Starling-LM-7B-alpha

@@ -21,9 +21,9 @@ tags:
-We introduce Starling-7B, an open large language model (LLM) trained by Reinforcement Learning from AI Feedback (RLAIF). The model harnesses the power of our new GPT-4 labeled ranking dataset, Nectar, and our new reward training and policy tuning pipeline. Starling-7B-alpha scores 8.09 in MT Bench with GPT-4 as a judge, outperforming every model to date on MT-Bench except for OpenAI's GPT-4 and GPT-4 Turbo. We release the ranking dataset [Nectar](https://huggingface.co/berkeley-nest/nector), the reward model [Starling-RM-7B-alpha](https://huggingface.co/berkeley-nest/Starling-RM-7B-alpha) and the language model [Starling-LM-7B-alpha](https://huggingface.co/berkeley-nest/Starling-LM-7B-alpha) on HuggingFace, and an online demo in LMSYS [Chatbot Arena](https://chat.lmsys.org). Stay tuned for our forthcoming code and paper, which will provide more details on the whole process.
-Starling-LM-7B-alpha is a language model trained from [Openchat 3.5](https://huggingface.co/openchat/openchat_3.5)
 | Model                 | Tuning Method    | MT Bench | AlpacaEval | MMLU |
@@ -43,12 +43,6 @@ Starling-LM-7B-alpha is a language model trained from [Openchat 3.5](https://hug
-ollowing the method of training reward model in [the instructGPT paper](https://arxiv.org/abs/2203.02155), we remove the last layer of Llama2-7B Chat,
-and concatenate a linear layer that outputs scalar for any pair of input prompt and response. We train the reward model with preference dataset [berkeley-nest/Nectar](https://huggingface.co/berkeley-nest),
-with the K-wise maximum likelihood estimator proposed in [this paper](https://arxiv.org/abs/2301.11270). The reward model outputs a scalar for any given prompt and response. A response that is more helpful and
-less harmful will get the highest reward score. Note that since the preference dataset [berkeley-nest/Nectar](https://huggingface.co/berkeley-nest) is based on GPT-4 preference, the reward model is likely to be biased
-towards GPT-4's own preference, including longer responses and certain response format.
 For more detailed discussions, please check out our [blog post](https://starling.cs.berkeley.edu), and stay tuned for our upcoming code and paper!
 <!-- Provide the basic links for the model. -->

+We introduce Starling-7B, an open large language model (LLM) trained by Reinforcement Learning from AI Feedback (RLAIF). The model harnesses the power of our new GPT-4 labeled ranking dataset, [berkeley-nest/Nectar](https://huggingface.co/datasets/berkeley-nest/Nectar), and our new reward training and policy tuning pipeline. Starling-7B-alpha scores 8.09 in MT Bench with GPT-4 as a judge, outperforming every model to date on MT-Bench except for OpenAI's GPT-4 and GPT-4 Turbo. We release the ranking dataset [Nectar](https://huggingface.co/berkeley-nest/nector), the reward model [Starling-RM-7B-alpha](https://huggingface.co/berkeley-nest/Starling-RM-7B-alpha) and the language model [Starling-LM-7B-alpha](https://huggingface.co/berkeley-nest/Starling-LM-7B-alpha) on HuggingFace, and an online demo in LMSYS [Chatbot Arena](https://chat.lmsys.org). Stay tuned for our forthcoming code and paper, which will provide more details on the whole process.
+Starling-LM-7B-alpha is a language model trained from [Openchat 3.5](https://huggingface.co/openchat/openchat_3.5) with reward model [berkeley-nest/Starling-RM-7B-alpha](https://huggingface.co/berkeley-nest/Starling-RM-7B-alpha) and policy optimization method [advantage-induced policy alignment (APA)](https://arxiv.org/abs/2306.02231). The evaluation results are listed below.
 | Model                 | Tuning Method    | MT Bench | AlpacaEval | MMLU |
 For more detailed discussions, please check out our [blog post](https://starling.cs.berkeley.edu), and stay tuned for our upcoming code and paper!
 <!-- Provide the basic links for the model. -->