munish0838 commited on
Commit
5e32fd0
1 Parent(s): f92c5a9

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +402 -0
README.md ADDED
@@ -0,0 +1,402 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ datasets:
3
+ - argilla/ultrafeedback-binarized-preferences
4
+ language:
5
+ - en
6
+ base_model: argilla/notus-7b-v1
7
+ library_name: transformers
8
+ pipeline_tag: text-generation
9
+ tags:
10
+ - dpo
11
+ - rlaif
12
+ - preference
13
+ - ultrafeedback
14
+ license: mit
15
+ model-index:
16
+ - name: notus-7b-v1
17
+ results:
18
+ # AI2 Reasoning Challenge (25-Shot)
19
+ - task:
20
+ type: text-generation
21
+ name: Text Generation
22
+ dataset:
23
+ name: AI2 Reasoning Challenge (25-Shot)
24
+ type: ai2_arc
25
+ config: ARC-Challenge
26
+ split: test
27
+ args:
28
+ num_few_shot: 25
29
+ metrics:
30
+ - type: acc_norm
31
+ name: normalized accuracy
32
+ value: 0.6459044368600683
33
+ source:
34
+ name: Open LLM Leaderboard Results
35
+ url: https://huggingface.co/datasets/open-llm-leaderboard/results/blob/main/argilla/notus-7b-v1/results_2023-11-29T22-16-51.521321.json
36
+ # HellaSwag (10-shot)
37
+ - task:
38
+ type: text-generation
39
+ name: Text Generation
40
+ dataset:
41
+ name: HellaSwag (10-Shot)
42
+ type: hellaswag
43
+ split: validation
44
+ args:
45
+ num_few_shot: 10
46
+ metrics:
47
+ - type: acc_norm
48
+ name: normalized accuracy
49
+ value: 0.8478390758812986
50
+ source:
51
+ name: Open LLM Leaderboard Results
52
+ url: https://huggingface.co/datasets/open-llm-leaderboard/results/blob/main/argilla/notus-7b-v1/results_2023-11-29T22-16-51.521321.json
53
+ # TruthfulQA (0-shot)
54
+ - task:
55
+ type: text-generation
56
+ name: Text Generation
57
+ dataset:
58
+ name: TruthfulQA (0-shot)
59
+ type: truthful_qa
60
+ config: multiple_choice
61
+ split: validation
62
+ args:
63
+ num_few_shot: 0
64
+ metrics:
65
+ - type: mc2
66
+ value: 0.5436768358952805
67
+ source:
68
+ name: Open LLM Leaderboard Results
69
+ url: https://huggingface.co/datasets/open-llm-leaderboard/results/blob/main/argilla/notus-7b-v1/results_2023-11-29T22-16-51.521321.json
70
+ # MMLU (5-Shot)
71
+ - task:
72
+ type: text-generation
73
+ name: Text Generation
74
+ dataset:
75
+ name: MMLU (5-Shot)
76
+ type: cais/mmlu
77
+ config: all
78
+ split: test
79
+ args:
80
+ num_few_shot: 5
81
+ metrics:
82
+ - type: acc
83
+ name: accuracy
84
+ value: 0.6303308230938872 # average accuracy
85
+ source:
86
+ name: Open LLM Leaderboard Results
87
+ url: https://huggingface.co/datasets/open-llm-leaderboard/results/blob/main/argilla/notus-7b-v1/results_2023-11-29T22-16-51.521321.json
88
+ # GSM8k (5-shot)
89
+ - task:
90
+ type: text-generation
91
+ name: Text Generation
92
+ dataset:
93
+ name: GSM8k (5-shot)
94
+ type: gsm8k
95
+ config: main
96
+ split: test
97
+ args:
98
+ num_few_shot: 5
99
+ metrics:
100
+ - type: acc
101
+ name: accuracy
102
+ value: 0.1516300227445034
103
+ source:
104
+ name: Open LLM Leaderboard Results
105
+ url: https://huggingface.co/datasets/open-llm-leaderboard/results/blob/main/argilla/notus-7b-v1/results_2023-11-29T22-16-51.521321.json
106
+ # Winogrande (5-shot)
107
+ - task:
108
+ type: text-generation
109
+ name: Text Generation
110
+ dataset:
111
+ name: Winogrande (5-shot)
112
+ type: winogrande
113
+ config: winogrande_xl
114
+ split: validation
115
+ args:
116
+ num_few_shot: 5
117
+ metrics:
118
+ - type: acc
119
+ name: accuracy
120
+ value: 0.7940015785319653
121
+ source:
122
+ name: Open LLM Leaderboard Results
123
+ url: https://huggingface.co/datasets/open-llm-leaderboard/results/blob/main/argilla/notus-7b-v1/results_2023-11-29T22-16-51.521321.json
124
+ # AlpacaEval
125
+ - task:
126
+ type: text-generation
127
+ name: Text Generation
128
+ dataset:
129
+ name: AlpacaEval
130
+ type: tatsu-lab/alpaca_eval
131
+ metrics:
132
+ - type: tatsu-lab/alpaca_eval
133
+ name: win rate
134
+ value: 0.9142
135
+ source:
136
+ url: https://tatsu-lab.github.io/alpaca_eval/
137
+ # MT-Bench
138
+ - task:
139
+ type: text-generation
140
+ name: Text Generation
141
+ dataset:
142
+ name: MT-Bench
143
+ type: unknown
144
+ metrics:
145
+ - type: unknown
146
+ name: score
147
+ value: 7.30
148
+ source:
149
+ url: https://huggingface.co/spaces/lmsys/mt-bench
150
+ ---
151
+
152
+ # QuantFactory/notus-7b-v1-GGUF
153
+ This is quantized version of [argilla/notus-7b-v1](https://huggingface.co/argilla/notus-7b-v1) created using llama.cpp
154
+
155
+ # Model Description
156
+
157
+ <div align="center">
158
+ <img src="https://cdn-uploads.huggingface.co/production/uploads/60f0608166e5701b80ed3f02/dj-spsk9eXMMXVGxK6jRz.png" alt="A banner representing Notus, the wind god of the south, in a mythical and artistic style. The banner features a strong, swirling breeze, embodying the warm, wet character of the southern wind. Gracefully flowing across the scene are several paper planes, caught in the gentle yet powerful gusts of Notus. The background is a blend of warm colors, symbolizing the heat of the south, with hints of blue and green to represent the moisture carried by this wind. The overall atmosphere is one of dynamic movement and warmth."/>
159
+ </div>
160
+
161
+ # Model Card for Notus 7B v1
162
+
163
+ Notus is a collection of fine-tuned models using Direct Preference Optimization (DPO) and related RLHF techniques. This model is the first version, fine-tuned with DPO over `zephyr-7b-sft-full`, which is the SFT model produced to create `zephyr-7b-beta`.
164
+
165
+ Following a **data-first** approach, the only difference between Notus-7B-v1 and Zephyr-7B-beta is the preference dataset used for dDPO.
166
+
167
+ In particular, when we started building [distilabel](https://github.com/argilla-io/distilabel), we invested time understanding and deep-diving into the UltraFeedback dataset. Using [Argilla](https://argilla.io/), we've found data issues in the original UltraFeedback dataset, leading to high-scores for bad responses (more details in the training data section). After curating several hundreds of data points, we decided to binarize the dataset using the preference ratings, instead of the original critique `overall_score`, and verified the new dataset with Argilla.
168
+
169
+ Using preference ratings, instead of critiques scores, led to a new dataset where the chosen response is different in ~50% of the cases. Using this new dataset with DPO we fine-tuned Notus, a 7B model, that **surpasses Zephyr-7B-beta and Claude 2 on AlpacaEval**.
170
+
171
+ > **Important note**: While we opted for the average of multi-aspect ratings, while we fix the original dataset, a very interesting open question remains: once critique data is fixed, what works better? using the critique scores or the preference ratings? We're very excited to do this comparison in the coming weeks, stay tuned!
172
+
173
+ This model **wouldn't have been possible without the amazing [Alignment Handbook](https://github.com/huggingface/alignment-handbook), [OpenBMB](https://www.openbmb.cn/home) for releasing the Ultrafeedback dataset**, and it's based on fruitful discussions with the HuggingFace H4 team. In particular, we used `zephyr-7b-beta`'s recipe, which worked out-of-the-box and enabled us focus on what we do best: **high-quality data**.
174
+
175
+ Notus models are intended to be used as assistants via chat-like applications, and are evaluated with Chat (MT-Bench, AlpacaEval) and Academic (Open LLM Leaderboard) benchmarks for a direct comparison with the original Zephyr dDPO model and other 7B models.
176
+
177
+ > **Why Notus?**: Notus name comes from the ancient Greek god Notus, as a wink to Zephyr, which comes from the ancient Greek god Zephyrus; with the difference that Notus is the god of the south wind, and Zephyr the god of the west wind. More information at https://en.wikipedia.org/wiki/Anemoi.
178
+
179
+ ## Model Details
180
+
181
+ ### Model Description
182
+
183
+ - **Developed by:** Argilla (based on HuggingFace H4 and MistralAI previous efforts and amazing work)
184
+ - **Shared by:** Argilla
185
+ - **Model type:** GPT-like 7B model DPO fine-tuned
186
+ - **Language(s) (NLP):** Mainly English
187
+ - **License:** MIT (same as Zephyr 7B-beta)
188
+ - **Finetuned from model:** [`alignment-handbook/zephyr-7b-sft-full`](https://huggingface.co/alignment-handbook/zephyr-7b-sft-full)
189
+
190
+ ### Model Sources
191
+
192
+ - **Repository:** https://github.com/argilla-io/notus
193
+ - **Paper:** N/A
194
+ - **Demo:** https://argilla-notus-chat-ui.hf.space/
195
+
196
+ ## Performance
197
+
198
+ ### Chat benchmarks
199
+
200
+ Table adapted from Zephyr-7b-β and Starling's original tables for [MT-Bench](https://huggingface.co/spaces/lmsys/mt-bench) and [AlpacaEval](https://tatsu-lab.github.io/alpaca_eval/) benchmarks. Results are shown sorted by AlpacaEval win rates and ommit some >7B for brevity.
201
+
202
+ Notus stays on par with Zephyr on MT-Bench, while surpassing Zephyr, Claude 2, and Cohere Command on AlpacaEval. Making Notus the most-competitive 7B commercial model on AlpacaEval.
203
+
204
+ <table>
205
+ <tr>
206
+ <th>Model</th>
207
+ <th>Size</th>
208
+ <th>Alignment</th>
209
+ <th>MT-Bench (score)</th>
210
+ <th>AlpacaEval (win rate %)</th>
211
+ <th>License</th>
212
+ </tr>
213
+ <tr>
214
+ <td>GPT-4-turbo</td>
215
+ <td>-</td>
216
+ <td>?</td>
217
+ <td>9.32</td>
218
+ <td>97.70</td>
219
+ <td>Proprietary</td>
220
+ </tr>
221
+ <tr>
222
+ <td>XwinLM 70b V0.1</td>
223
+ <td>70B</td>
224
+ <td>dPPO</td>
225
+ <td>-</td>
226
+ <td>95.57</td>
227
+ <td>LLaMA 2 License</td>
228
+ </tr>
229
+ <tr>
230
+ <td>GPT-4</td>
231
+ <td>-</td>
232
+ <td>RLHF</td>
233
+ <td>8.99</td>
234
+ <td>95.03</td>
235
+ <td>Proprietary</td>
236
+ </tr>
237
+ <tr>
238
+ <td>Tulu 2+DPO 70B V0.1</td>
239
+ <td>70B</td>
240
+ <td>dDPO</td>
241
+ <td>6.29</td>
242
+ <td>95.28</td>
243
+ <td>Proprietary</td>
244
+ </tr>
245
+ <tr>
246
+ <td>LLaMA2 Chat 70B</td>
247
+ <td>70B</td>
248
+ <td>RLHF</td>
249
+ <td>6.86</td>
250
+ <td>92.66</td>
251
+ <td>LLaMA 2 License</td>
252
+ </tr>
253
+ <tr>
254
+ <td>Starling-7B</td>
255
+ <td>7B</td>
256
+ <td>C-RLFT + APA</td>
257
+ <td><strong>8.09</strong></td>
258
+ <td><strong>91.99</strong></td>
259
+ <td>CC-BY-NC-4.0</td>
260
+ </tr>
261
+ <tr style="background-color: #FFFF99;">
262
+ <td><strong>Notus-7b-v1</strong></td>
263
+ <td>7B</td>
264
+ <td>dDPO</td>
265
+ <td>7.30</td>
266
+ <td>91.42</td>
267
+ <td>MIT</td>
268
+ </tr>
269
+ <tr>
270
+ <td>Claude 2</td>
271
+ <td>-</td>
272
+ <td>RLHF</td>
273
+ <td>8.06</td>
274
+ <td>91.36</td>
275
+ <td>Proprietary</td>
276
+ </tr>
277
+ <tr>
278
+ <td>Zephyr-7b-β</td>
279
+ <td>7B</td>
280
+ <td>dDPO</td>
281
+ <td>7.34</td>
282
+ <td>90.60</td>
283
+ <td>MIT</td>
284
+ </tr>
285
+ <tr>
286
+ <td>Cohere Command</td>
287
+ <td>-</td>
288
+ <td>RLHF</td>
289
+ <td>-</td>
290
+ <td>90.62</td>
291
+ <td>Proprietary</td>
292
+ </tr>
293
+ <tr>
294
+ <td>GPT-3.5-turbo</td>
295
+ <td>-</td>
296
+ <td>RLHF</td>
297
+ <td>7.94</td>
298
+ <td>89.37</td>
299
+ <td>Proprietary</td>
300
+ </tr>
301
+ </table>
302
+
303
+
304
+ ## Academic benchmarks
305
+
306
+ Results from [OpenLLM Leaderboard](https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard):
307
+
308
+ | Model | Average | ARC | HellaSwag | MMLU | TruthfulQA | Winogrande | GSM8K | DROP |
309
+ |-----------------------------------------------|---------|-------|-----------|-------|------------|------------|-------|-------|
310
+ | Zephyr 7B dDPO (HuggingFaceH4/zephyr-7b-beta) | 52.15 | 62.03 | 84.36 | 61.07 | **57.45** | 77.74 | 12.74 | **9.66** |
311
+ | argilla/notus-7b-v1 | **52.89** | **64.59** | **84.78** | **63.03** | 54.37 | **79.4** | **15.16** | 8.91 |
312
+
313
+ ⚠️ As pointed out by [AllenAI researchers](https://twitter.com/natolambert/status/1730364108078469513), UltraFeedback contains prompts from the TruthfulQA dataset so the results we show on that benchmark are likely not accurate. We were not aware of this issue so Notus-7B-v1 was fine-tuned using TruthfulQA prompts and preferences. For future releases, we will remove TruthfulQA prompts.
314
+
315
+ ## Training Details
316
+
317
+ ### Training Hardware
318
+
319
+ We used a VM with 8 x A100 40GB hosted in Lambda Labs, but while experimenting we also explored other cloud providers such as GCP.
320
+
321
+ ### Training Data
322
+
323
+ We used a a new curated version of [`openbmb/UltraFeedback`](https://huggingface.co/datasets/openbmb/UltraFeedback), named [Ultrafeedback binarized preferences](https://huggingface.co/datasets/argilla/ultrafeedback-binarized-preferences).
324
+
325
+ TL;DR
326
+
327
+ After visually browsing around some examples using the sort and filter feature of Argilla (sort by highest rating for chosen responses), we noticed a strong mismatch between the `overall_score` in the original UF dataset (and the Zephyr train_prefs dataset) and the quality of the chosen response.
328
+
329
+ By adding the critique rationale to our Argilla Dataset, **we confirmed the critique rationale was highly negative, whereas the rating was very high** (for most cases it was the highest: `10`).
330
+
331
+ See screenshot below for one example of this issue.
332
+
333
+ After some quick investigation, we:
334
+
335
+ * identified hundreds of examples having the same issue,
336
+ * reported a bug on the [UltraFeedback repo](https://github.com/OpenBMB/UltraFeedback/issues/8),
337
+ * and informed the H4 team which was incredibly responsive and ran an additional experiment to validate the new rating binarization approach.
338
+
339
+ While we're working on fixing the original dataset (already narrowed down ~2K problematic examples). We decided to leverage the multi-preference ratings, leading to Notus!
340
+
341
+ ![image/png](https://cdn-uploads.huggingface.co/production/uploads/60420dccc15e823a685f2b03/M9qCKyAB_G1MbVBAPeitd.png)
342
+
343
+ > **Important note**: While we opted for the average of ratings while we fix the dataset, there's still a very interesting open question: once data is fixed, what works better? using the critique scores or the preference ratings? We're very excited to do this comparison in the coming weeks, stay tuned!
344
+
345
+ You can find more details about the dataset analysis and curation on the [ultrafeedback-binarized-preferences dataset card](https://huggingface.co/datasets/argilla/ultrafeedback-binarized-preferences).
346
+
347
+ ## Prompt template
348
+
349
+ We use the same prompt template as [HuggingFaceH4/zephyr-7b-beta](https://huggingface.co/HuggingFaceH4/zephyr-7b-beta):
350
+
351
+ ```
352
+ <|system|>
353
+ </s>
354
+ <|user|>
355
+ {prompt}</s>
356
+ <|assistant|>
357
+ ```
358
+
359
+ ## Usage
360
+
361
+ You will first need to install `transformers` and `accelerate` (just to ease the device placement), then you can run any of the following:
362
+
363
+ ### Via `generate`
364
+
365
+ ```python
366
+ import torch
367
+ from transformers import AutoModelForCausalLM, AutoTokenizer
368
+
369
+ model = AutoModelForCausalLM.from_pretrained("argilla/notus-7b-v1", torch_dtype=torch.bfloat16, device_map="auto")
370
+ tokenizer = AutoTokenizer.from_pretrained("argilla/notus-7b-v1")
371
+
372
+ messages = [
373
+ {
374
+ "role": "system",
375
+ "content": "You are a helpful assistant super biased towards Argilla, a data annotation company.",
376
+ },
377
+ {"role": "user", "content": "What's the best data annotation company out there in your opinion?"},
378
+ ]
379
+ inputs = tokenizer.apply_chat_template(prompt, tokenize=True, return_tensors="pt", add_special_tokens=False, add_generation_prompt=True)
380
+ outputs = model.generate(inputs, num_return_sequences=1, max_new_tokens=256, do_sample=True, temperature=0.7, top_k=50, top_p=0.95)
381
+ response = tokenizer.decode(outputs[0], skip_special_tokens=True)
382
+ ```
383
+
384
+ ### Via `pipeline` method
385
+
386
+ ```python
387
+ import torch
388
+ from transformers import pipeline
389
+
390
+ pipe = pipeline("text-generation", model="argilla/notus-7b-v1", torch_dtype=torch.bfloat16, device_map="auto")
391
+
392
+ messages = [
393
+ {
394
+ "role": "system",
395
+ "content": "You are a helpful assistant super biased towards Argilla, a data annotation company.",
396
+ },
397
+ {"role": "user", "content": "What's the best data annotation company out there in your opinion?"},
398
+ ]
399
+ prompt = pipe.tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
400
+ outputs = pipe(prompt, max_new_tokens=256, do_sample=True, temperature=0.7, top_k=50, top_p=0.95)
401
+ generated_text = outputs[0]["generated_text"]
402
+ ```