dvilasuero HF staff commited on
Commit
3455ea2
1 Parent(s): 20b81a6

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +104 -24
README.md CHANGED
@@ -3,7 +3,7 @@ model-index:
3
  - name: notus-7b-v1
4
  results: []
5
  datasets:
6
- - argilla/ultrafeedback-binarized-avg-rating-for-dpo
7
  language:
8
  - en
9
  base_model: alignment-handbook/zephyr-7b-sft-full
@@ -20,15 +20,12 @@ license: mit
20
  </div>
21
 
22
  # Model Card for Notus 7B v1
23
-
24
-
25
-
26
  Notus is a collection of fine-tuned models using Direct Preference Optimization (DPO) and related RLHF techniques. This model is version 1, fine-tuned with DPO starting with zephyr-7b-beta's SFT model.
27
 
28
  Following a **data-first** approach, the only difference between Notus-7B-v1 and Zephyr-7B-beta is the preference dataset used for dDPO. In particular, we've found data issues in the original UltraFeedback dataset, leading to high-scores for bad responses. After curating several hundreds of data points, we decided to binarize the dataset using the preference ratings, instead of the original critique `overall_score`.
29
  Using preference ratings, instead of critiques scores, led to a new dataset where the chosen response is different in ~50% of the cases.
30
 
31
- This model wouldn't have been possible without the amazing [Alignment Handbook]( https://github.com/huggingface/alignment-handbook/tree/main/recipes/zephyr-7b-beta) and it's based on fruitful discussions with the H4 team. In particular, we used zephyr-7b-beta's recipe, which worked out-of-the-box and let us focus on what we do best: **high-quality data**.
32
 
33
  Notus models are intended to be used as assistants via chat-like applications, and
34
  are evaluated with Chat (MT-Bench, AlpacaEval) and Academic (Open LLM Leaderboard) benchmarks for a direct comparison
@@ -54,25 +51,108 @@ with the original Zephyr dDPO model and other 7B models.
54
  ## Performance
55
 
56
  ### Chat benchmarks
57
- Table adapted from Zephyr-7b-β original table for [MT-Bench](https://huggingface.co/spaces/lmsys/mt-bench) and [AlpacaEval](https://tatsu-lab.github.io/alpaca_eval/) benchmarks. Notus stays on par with Zephyr on MT-Bench, while surpassing Zephyr, Claude 2, and Cohere Command on AlpacaEval. Making Notus the most-competitive 7B commercial model on AlpacaEval.
58
-
59
- | Model | Size | Alignment | MT-Bench (score) | AlpacaEval (win rate %) |
60
- |-------------|-----|----|---------------|--------------|
61
- | MPT-Chat | 7B |dSFT |5.42| -|
62
- | Xwin-LMv0.1 | 7B| dPPO| 6.19| 87.83|
63
- | Mistral-Instructv0.1 | 7B| - | 6.84 |-|
64
- | Zephyr-7b-β 🪁 | 7B | dDPO | **7.34** | 90.60 |
65
- | **Notus-7b-v1** | 7B | dDPO | 7.30 | **91.42** |
66
- | GPT-3.5-turbo | - |RLHF |7.94 |89.37|
67
- | Claude 2 | - |RLHF |8.06| 91.36|
68
- | Cohere Command | - |RLHF |-| 90.62|
69
- | GPT-4 | -| RLHF |8.99| 95.28|
70
- | Falcon-Instruct | 40B |dSFT |5.17 |45.71|
71
- | Guanaco | 65B | SFT |6.41| 71.80|
72
- | Llama2-Chat | 70B |RLHF |6.86| 92.66|
73
- | Vicuna v1.3 | 33B |dSFT |7.12 |88.99|
74
- | WizardLM v1.0 | 70B |dSFT |7.71 |-|
75
- | Xwin-LM v0.1 | 70B |dPPO |- |95.57|
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
76
 
77
  ## Academic benchmarks
78
 
 
3
  - name: notus-7b-v1
4
  results: []
5
  datasets:
6
+ - argilla/ultrafeedback-binarized-preferences
7
  language:
8
  - en
9
  base_model: alignment-handbook/zephyr-7b-sft-full
 
20
  </div>
21
 
22
  # Model Card for Notus 7B v1
 
 
 
23
  Notus is a collection of fine-tuned models using Direct Preference Optimization (DPO) and related RLHF techniques. This model is version 1, fine-tuned with DPO starting with zephyr-7b-beta's SFT model.
24
 
25
  Following a **data-first** approach, the only difference between Notus-7B-v1 and Zephyr-7B-beta is the preference dataset used for dDPO. In particular, we've found data issues in the original UltraFeedback dataset, leading to high-scores for bad responses. After curating several hundreds of data points, we decided to binarize the dataset using the preference ratings, instead of the original critique `overall_score`.
26
  Using preference ratings, instead of critiques scores, led to a new dataset where the chosen response is different in ~50% of the cases.
27
 
28
+ This model wouldn't have been possible without the amazing [Alignment Handbook]( https://github.com/huggingface/alignment-handbook/tree/main/recipes/zephyr-7b-beta) and it's based on fruitful discussions with the H4 team. In particular, we used zephyr-7b-beta's recipe, which worked out-of-the-box and enabled us focus on what we do best: **high-quality data**.
29
 
30
  Notus models are intended to be used as assistants via chat-like applications, and
31
  are evaluated with Chat (MT-Bench, AlpacaEval) and Academic (Open LLM Leaderboard) benchmarks for a direct comparison
 
51
  ## Performance
52
 
53
  ### Chat benchmarks
54
+ Table adapted from Zephyr-7b-β and Starling's original tables for [MT-Bench](https://huggingface.co/spaces/lmsys/mt-bench) and [AlpacaEval](https://tatsu-lab.github.io/alpaca_eval/) benchmarks. Results are shown sorted by AlpacaEval win rates and ommit some >7B for brevity.
55
+ Notus stays on par with Zephyr on MT-Bench, while surpassing Zephyr, Claude 2, and Cohere Command on AlpacaEval. Making Notus the most-competitive 7B commercial model on AlpacaEval.
56
+ <table>
57
+ <tr>
58
+ <th>Model</th>
59
+ <th>Size</th>
60
+ <th>Alignment</th>
61
+ <th>MT-Bench (score)</th>
62
+ <th>AlpacaEval (win rate %)</th>
63
+ <th>License</th>
64
+ </tr>
65
+ <tr>
66
+ <td>GPT-4-turbo</td>
67
+ <td>-</td>
68
+ <td>?</td>
69
+ <td>9.32</td>
70
+ <td>97.70</td>
71
+ <td>Proprietary</td>
72
+ </tr>
73
+ <tr>
74
+ <td>XwinLM 70b V0.1</td>
75
+ <td>70B</td>
76
+ <td>dPPO</td>
77
+ <td>-</td>
78
+ <td>95.57</td>
79
+ <td>LLaMA 2 License</td>
80
+ </tr>
81
+ <tr>
82
+ <td>GPT-4</td>
83
+ <td>-</td>
84
+ <td>RLHF</td>
85
+ <td>8.99</td>
86
+ <td>95.03</td>
87
+ <td>Proprietary</td>
88
+ </tr>
89
+ <tr>
90
+ <td>Tulu 2+DPO 70B V0.1</td>
91
+ <td>70B</td>
92
+ <td>dDPO</td>
93
+ <td>6.29</td>
94
+ <td>95.28</td>
95
+ <td>Proprietary</td>
96
+ </tr>
97
+ <tr>
98
+ <td>LLaMA2 Chat 70B</td>
99
+ <td>70B</td>
100
+ <td>RLHF</td>
101
+ <td>6.86</td>
102
+ <td>92.66</td>
103
+ <td>LLaMA 2 License</td>
104
+ </tr>
105
+ <tr>
106
+ <td>Starling-7B</td>
107
+ <td>7B</td>
108
+ <td>C-RLFT + APA</td>
109
+ <td><strong>8.09</strong></td>
110
+ <td><strong>91.99</strong></td>
111
+ <td>CC-BY-NC-4.0</td>
112
+ </tr>
113
+ <tr style="background-color: #FFFF99;">
114
+ <td><strong>Notus-7b-v1</strong></td>
115
+ <td>7B</td>
116
+ <td>dDPO</td>
117
+ <td>7.30</td>
118
+ <td>91.42</td>
119
+ <td>MIT</td>
120
+ </tr>
121
+ <tr>
122
+ <td>Claude 2</td>
123
+ <td>-</td>
124
+ <td>RLHF</td>
125
+ <td>8.06</td>
126
+ <td>91.36</td>
127
+ <td>Proprietary</td>
128
+ </tr>
129
+ <tr>
130
+ <td>Zephyr-7b-β</td>
131
+ <td>7B</td>
132
+ <td>dDPO</td>
133
+ <td>7.34</td>
134
+ <td>90.60</td>
135
+ <td>MIT</td>
136
+ </tr>
137
+ <tr>
138
+ <td>Cohere Command</td>
139
+ <td>-</td>
140
+ <td>RLHF</td>
141
+ <td>-</td>
142
+ <td>90.62</td>
143
+ <td>Proprietary</td>
144
+ </tr>
145
+ <tr>
146
+ <td>GPT-3.5-turbo</td>
147
+ <td>-</td>
148
+ <td>RLHF</td>
149
+ <td>7.94</td>
150
+ <td>89.37</td>
151
+ <td>Proprietary</td>
152
+ </tr>
153
+ </table>
154
+
155
+
156
 
157
  ## Academic benchmarks
158