Update README.md
Browse files
README.md
CHANGED
@@ -21,9 +21,9 @@ tags:
|
|
21 |
|
22 |
|
23 |
|
24 |
-
We introduce Starling-7B, an open large language model (LLM) trained by Reinforcement Learning from AI Feedback (RLAIF). The model harnesses the power of our new GPT-4 labeled ranking dataset, Nectar, and our new reward training and policy tuning pipeline. Starling-7B-alpha scores 8.09 in MT Bench with GPT-4 as a judge, outperforming every model to date on MT-Bench except for OpenAI's GPT-4 and GPT-4 Turbo. We release the ranking dataset [Nectar](https://huggingface.co/berkeley-nest/nector), the reward model [Starling-RM-7B-alpha](https://huggingface.co/berkeley-nest/Starling-RM-7B-alpha) and the language model [Starling-LM-7B-alpha](https://huggingface.co/berkeley-nest/Starling-LM-7B-alpha) on HuggingFace, and an online demo in LMSYS [Chatbot Arena](https://chat.lmsys.org). Stay tuned for our forthcoming code and paper, which will provide more details on the whole process.
|
25 |
|
26 |
-
Starling-LM-7B-alpha is a language model trained from [Openchat 3.5](https://huggingface.co/openchat/openchat_3.5)
|
27 |
|
28 |
|
29 |
| Model | Tuning Method | MT Bench | AlpacaEval | MMLU |
|
@@ -43,12 +43,6 @@ Starling-LM-7B-alpha is a language model trained from [Openchat 3.5](https://hug
|
|
43 |
|
44 |
|
45 |
|
46 |
-
ollowing the method of training reward model in [the instructGPT paper](https://arxiv.org/abs/2203.02155), we remove the last layer of Llama2-7B Chat,
|
47 |
-
and concatenate a linear layer that outputs scalar for any pair of input prompt and response. We train the reward model with preference dataset [berkeley-nest/Nectar](https://huggingface.co/berkeley-nest),
|
48 |
-
with the K-wise maximum likelihood estimator proposed in [this paper](https://arxiv.org/abs/2301.11270). The reward model outputs a scalar for any given prompt and response. A response that is more helpful and
|
49 |
-
less harmful will get the highest reward score. Note that since the preference dataset [berkeley-nest/Nectar](https://huggingface.co/berkeley-nest) is based on GPT-4 preference, the reward model is likely to be biased
|
50 |
-
towards GPT-4's own preference, including longer responses and certain response format.
|
51 |
-
|
52 |
For more detailed discussions, please check out our [blog post](https://starling.cs.berkeley.edu), and stay tuned for our upcoming code and paper!
|
53 |
<!-- Provide the basic links for the model. -->
|
54 |
|
|
|
21 |
|
22 |
|
23 |
|
24 |
+
We introduce Starling-7B, an open large language model (LLM) trained by Reinforcement Learning from AI Feedback (RLAIF). The model harnesses the power of our new GPT-4 labeled ranking dataset, [berkeley-nest/Nectar](https://huggingface.co/datasets/berkeley-nest/Nectar), and our new reward training and policy tuning pipeline. Starling-7B-alpha scores 8.09 in MT Bench with GPT-4 as a judge, outperforming every model to date on MT-Bench except for OpenAI's GPT-4 and GPT-4 Turbo. We release the ranking dataset [Nectar](https://huggingface.co/berkeley-nest/nector), the reward model [Starling-RM-7B-alpha](https://huggingface.co/berkeley-nest/Starling-RM-7B-alpha) and the language model [Starling-LM-7B-alpha](https://huggingface.co/berkeley-nest/Starling-LM-7B-alpha) on HuggingFace, and an online demo in LMSYS [Chatbot Arena](https://chat.lmsys.org). Stay tuned for our forthcoming code and paper, which will provide more details on the whole process.
|
25 |
|
26 |
+
Starling-LM-7B-alpha is a language model trained from [Openchat 3.5](https://huggingface.co/openchat/openchat_3.5) with reward model [berkeley-nest/Starling-RM-7B-alpha](https://huggingface.co/berkeley-nest/Starling-RM-7B-alpha) and policy optimization method [advantage-induced policy alignment (APA)](https://arxiv.org/abs/2306.02231). The evaluation results are listed below.
|
27 |
|
28 |
|
29 |
| Model | Tuning Method | MT Bench | AlpacaEval | MMLU |
|
|
|
43 |
|
44 |
|
45 |
|
|
|
|
|
|
|
|
|
|
|
|
|
46 |
For more detailed discussions, please check out our [blog post](https://starling.cs.berkeley.edu), and stay tuned for our upcoming code and paper!
|
47 |
<!-- Provide the basic links for the model. -->
|
48 |
|