alvarobartt HF staff commited on
Commit
57cc717
1 Parent(s): f891dd2

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +66 -78
README.md CHANGED
@@ -15,21 +15,22 @@ tags:
15
  - ultrafeedback
16
  license: mit
17
  ---
 
18
  <div align="center">
19
  <img src="https://cdn-uploads.huggingface.co/production/uploads/60420dccc15e823a685f2b03/CuMO3IjJfymC94_5qd15T.png" alt="Image was artificially generated by Dalle-3 via ChatGPT Pro"/>
20
  </div>
21
 
22
  # Model Card for Notus 7B v1
23
- Notus is a collection of fine-tuned models using Direct Preference Optimization (DPO) and related RLHF techniques. This model is version 1, fine-tuned with DPO starting with zephyr-7b-beta's SFT model.
 
24
 
25
  Following a **data-first** approach, the only difference between Notus-7B-v1 and Zephyr-7B-beta is the preference dataset used for dDPO. In particular, we've found data issues in the original UltraFeedback dataset, leading to high-scores for bad responses. After curating several hundreds of data points, we decided to binarize the dataset using the preference ratings, instead of the original critique `overall_score`.
 
26
  Using preference ratings, instead of critiques scores, led to a new dataset where the chosen response is different in ~50% of the cases.
27
 
28
- This model wouldn't have been possible without the amazing [Alignment Handbook]( https://github.com/huggingface/alignment-handbook/tree/main/recipes/zephyr-7b-beta) and it's based on fruitful discussions with the H4 team. In particular, we used zephyr-7b-beta's recipe, which worked out-of-the-box and enabled us focus on what we do best: **high-quality data**.
29
 
30
- Notus models are intended to be used as assistants via chat-like applications, and
31
- are evaluated with Chat (MT-Bench, AlpacaEval) and Academic (Open LLM Leaderboard) benchmarks for a direct comparison
32
- with the original Zephyr dDPO model and other 7B models.
33
 
34
  ## Model Details
35
 
@@ -51,6 +52,7 @@ with the original Zephyr dDPO model and other 7B models.
51
  ## Performance
52
 
53
  ### Chat benchmarks
 
54
  Table adapted from Zephyr-7b-β and Starling's original tables for [MT-Bench](https://huggingface.co/spaces/lmsys/mt-bench) and [AlpacaEval](https://tatsu-lab.github.io/alpaca_eval/) benchmarks. Results are shown sorted by AlpacaEval win rates and ommit some >7B for brevity.
55
 
56
  Notus stays on par with Zephyr on MT-Bench, while surpassing Zephyr, Claude 2, and Cohere Command on AlpacaEval. Making Notus the most-competitive 7B commercial model on AlpacaEval.
@@ -155,7 +157,6 @@ Notus stays on par with Zephyr on MT-Bench, while surpassing Zephyr, Claude 2, a
155
  </table>
156
 
157
 
158
-
159
  ## Academic benchmarks
160
 
161
  Results from [OpenLLM Leaderboard](https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard):
@@ -165,82 +166,69 @@ Results from [OpenLLM Leaderboard](https://huggingface.co/spaces/HuggingFaceH4/o
165
  | Zephyr 7B dDPO (HuggingFaceH4/zephyr-7b-beta) | 52.15 | 62.03 | 84.36 | 61.07 | **57.45** | 77.74 | 12.74 | **9.66** |
166
  | argilla/notus-7b-v1 | **52.89** | **64.59** | **84.78** | **63.03** | 54.37 | **79.4** | **15.16** | 8.91 |
167
 
168
-
169
-
170
  ## Training Details
171
 
172
  ### Training Hardware
173
 
174
- We used a VM with 8 x A100 40GB hosted in Lambda Labs.
175
 
176
  ### Training Data
177
 
178
- We used a a new curated version of [`openbmb/UltraFeedback`](https://huggingface.co/datasets/openbmb/UltraFeedback), named [`argilla/ultrafeedback-binarized-avg-rating-for-dpo`](https://huggingface.co/argilla/ultrafeedback-binarized-avg-rating-for-dpo).
179
-
180
- ### Training hyperparameters
181
-
182
- The following hyperparameters were used during training:
183
- - learning_rate: 5e-07
184
- - train_batch_size: 8
185
- - eval_batch_size: 4
186
- - seed: 42
187
- - distributed_type: multi-GPU
188
- - num_devices: 8
189
- - total_train_batch_size: 64
190
- - total_eval_batch_size: 32
191
- - optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
192
- - lr_scheduler_type: linear
193
- - lr_scheduler_warmup_ratio: 0.1
194
- - num_epochs: 3
195
-
196
- ### Training results
197
-
198
- | Training Loss | Epoch | Step | Validation Loss | Rewards/chosen | Rewards/rejected | Rewards/accuracies | Rewards/margins | Logps/rejected | Logps/chosen | Logits/rejected | Logits/chosen |
199
- |:-------------:|:-----:|:----:|:---------------:|:--------------:|:----------------:|:------------------:|:---------------:|:--------------:|:------------:|:---------------:|:-------------:|
200
- | 0.5051 | 0.1 | 100 | 0.5180 | 0.1475 | -0.3954 | 0.7183 | 0.5429 | -246.6286 | -297.5412 | -2.7438 | -3.0431 |
201
- | 0.4321 | 0.21 | 200 | 0.4375 | 0.1353 | -0.9529 | 0.7540 | 1.0882 | -252.2036 | -297.6632 | -2.7578 | -3.0543 |
202
- | 0.3848 | 0.31 | 300 | 0.4301 | -0.4813 | -1.8921 | 0.7302 | 1.4107 | -261.5956 | -303.8301 | -2.7592 | -3.0508 |
203
- | 0.3777 | 0.42 | 400 | 0.4091 | -0.8597 | -2.5306 | 0.7698 | 1.6709 | -267.9805 | -307.6138 | -2.7476 | -3.0474 |
204
- | 0.3559 | 0.52 | 500 | 0.4332 | -1.0424 | -2.6019 | 0.7619 | 1.5595 | -268.6939 | -309.4406 | -2.2960 | -2.6106 |
205
- | 0.4178 | 0.62 | 600 | 0.3934 | -0.6434 | -2.4837 | 0.7659 | 1.8404 | -267.5121 | -305.4503 | -2.5487 | -2.8508 |
206
- | 0.4206 | 0.73 | 700 | 0.4058 | -1.4700 | -3.5113 | 0.7857 | 2.0413 | -277.7877 | -313.7168 | -2.5679 | -2.8727 |
207
- | 0.4323 | 0.83 | 800 | 0.3929 | -0.9025 | -2.6935 | 0.7897 | 1.7910 | -269.6095 | -308.0414 | -2.6213 | -2.9202 |
208
- | 0.3706 | 0.93 | 900 | 0.3903 | -1.1122 | -3.0257 | 0.8056 | 1.9135 | -272.9316 | -310.1388 | -2.5428 | -2.8416 |
209
- | 0.0496 | 1.04 | 1000 | 0.3991 | -1.4248 | -4.1245 | 0.8016 | 2.6997 | -283.9196 | -313.2651 | -2.5093 | -2.8150 |
210
- | 0.0723 | 1.14 | 1100 | 0.3999 | -1.8789 | -4.5317 | 0.7897 | 2.6528 | -287.9914 | -317.8056 | -2.5170 | -2.8242 |
211
- | 0.0481 | 1.25 | 1200 | 0.4191 | -2.6211 | -5.5294 | 0.7817 | 2.9083 | -297.9687 | -325.2281 | -2.5139 | -2.8109 |
212
- | 0.0432 | 1.35 | 1300 | 0.4070 | -2.0605 | -5.0460 | 0.8056 | 2.9855 | -293.1345 | -319.6214 | -2.5153 | -2.8121 |
213
- | 0.0402 | 1.45 | 1400 | 0.4001 | -2.2445 | -5.0942 | 0.7937 | 2.8497 | -293.6164 | -321.4614 | -2.4383 | -2.7388 |
214
- | 0.0529 | 1.56 | 1500 | 0.4066 | -2.3499 | -5.2468 | 0.8016 | 2.8969 | -295.1426 | -322.5153 | -2.3906 | -2.6963 |
215
- | 0.0651 | 1.66 | 1600 | 0.3962 | -2.0597 | -4.8915 | 0.8016 | 2.8318 | -291.5901 | -319.6136 | -2.3390 | -2.6469 |
216
- | 0.0738 | 1.77 | 1700 | 0.3942 | -1.8893 | -4.6107 | 0.8135 | 2.7214 | -288.7817 | -317.9099 | -2.3532 | -2.6607 |
217
- | 0.0597 | 1.87 | 1800 | 0.3990 | -1.8774 | -4.7221 | 0.8175 | 2.8448 | -289.8961 | -317.7905 | -2.2728 | -2.5908 |
218
- | 0.0686 | 1.97 | 1900 | 0.3924 | -1.8745 | -4.6807 | 0.8056 | 2.8062 | -289.4821 | -317.7617 | -2.2554 | -2.5658 |
219
- | 0.0116 | 2.08 | 2000 | 0.4260 | -2.4687 | -5.7190 | 0.7937 | 3.2503 | -299.8647 | -323.7037 | -2.2297 | -2.5347 |
220
- | 0.0114 | 2.18 | 2100 | 0.4519 | -2.8266 | -6.3706 | 0.7976 | 3.5440 | -306.3802 | -327.2823 | -2.2185 | -2.5219 |
221
- | 0.0073 | 2.28 | 2200 | 0.4563 | -2.9422 | -6.5564 | 0.8016 | 3.6142 | -308.2384 | -328.4384 | -2.2103 | -2.5126 |
222
- | 0.0094 | 2.39 | 2300 | 0.4636 | -3.3246 | -7.0542 | 0.8016 | 3.7296 | -313.2165 | -332.2628 | -2.2059 | -2.5081 |
223
- | 0.0056 | 2.49 | 2400 | 0.4745 | -3.3599 | -7.1652 | 0.7976 | 3.8053 | -314.3266 | -332.6161 | -2.1945 | -2.4943 |
224
- | 0.0052 | 2.6 | 2500 | 0.4812 | -3.4916 | -7.3391 | 0.7976 | 3.8475 | -316.0656 | -333.9322 | -2.1888 | -2.4881 |
225
- | 0.0065 | 2.7 | 2600 | 0.4678 | -3.2226 | -6.9887 | 0.7976 | 3.7661 | -312.5613 | -331.2425 | -2.1644 | -2.4560 |
226
- | 0.0059 | 2.8 | 2700 | 0.4694 | -3.4307 | -7.2484 | 0.7976 | 3.8177 | -315.1584 | -333.3234 | -2.1572 | -2.4483 |
227
- | 0.0054 | 2.91 | 2800 | 0.4707 | -3.4959 | -7.3283 | 0.8056 | 3.8324 | -315.9576 | -333.9758 | -2.1575 | -2.4491 |
228
-
229
- ### Framework versions
230
-
231
- - Transformers 4.35.0
232
- - Pytorch 2.1.1+cu121
233
- - Datasets 2.14.6
234
- - Tokenizers 0.14.1
235
-
236
- ### Evaluation during Training
237
-
238
- - Loss: 0.4730
239
- - Rewards/chosen: -3.5289
240
- - Rewards/rejected: -7.3700
241
- - Rewards/accuracies: 0.8016
242
- - Rewards/margins: 3.8412
243
- - Logps/rejected: -316.3751
244
- - Logps/chosen: -334.3053
245
- - Logits/rejected: -2.1644
246
- - Logits/chosen: -2.4556
 
15
  - ultrafeedback
16
  license: mit
17
  ---
18
+
19
  <div align="center">
20
  <img src="https://cdn-uploads.huggingface.co/production/uploads/60420dccc15e823a685f2b03/CuMO3IjJfymC94_5qd15T.png" alt="Image was artificially generated by Dalle-3 via ChatGPT Pro"/>
21
  </div>
22
 
23
  # Model Card for Notus 7B v1
24
+
25
+ Notus is a collection of fine-tuned models using Direct Preference Optimization (DPO) and related RLHF techniques. This model is the first version, fine-tuned with DPO over `zephyr-7b-sft-full`, which is the SFT model produced to create `zephyr-7b-beta`.
26
 
27
  Following a **data-first** approach, the only difference between Notus-7B-v1 and Zephyr-7B-beta is the preference dataset used for dDPO. In particular, we've found data issues in the original UltraFeedback dataset, leading to high-scores for bad responses. After curating several hundreds of data points, we decided to binarize the dataset using the preference ratings, instead of the original critique `overall_score`.
28
+
29
  Using preference ratings, instead of critiques scores, led to a new dataset where the chosen response is different in ~50% of the cases.
30
 
31
+ This model wouldn't have been possible without the amazing [Alignment Handbook](https://github.com/huggingface/alignment-handbook) and it's based on fruitful discussions with the HuggingFace H4 team. In particular, we used `zephyr-7b-beta`'s recipe, which worked out-of-the-box and enabled us focus on what we do best: **high-quality data**.
32
 
33
+ Notus models are intended to be used as assistants via chat-like applications, and are evaluated with Chat (MT-Bench, AlpacaEval) and Academic (Open LLM Leaderboard) benchmarks for a direct comparison with the original Zephyr dDPO model and other 7B models.
 
 
34
 
35
  ## Model Details
36
 
 
52
  ## Performance
53
 
54
  ### Chat benchmarks
55
+
56
  Table adapted from Zephyr-7b-β and Starling's original tables for [MT-Bench](https://huggingface.co/spaces/lmsys/mt-bench) and [AlpacaEval](https://tatsu-lab.github.io/alpaca_eval/) benchmarks. Results are shown sorted by AlpacaEval win rates and ommit some >7B for brevity.
57
 
58
  Notus stays on par with Zephyr on MT-Bench, while surpassing Zephyr, Claude 2, and Cohere Command on AlpacaEval. Making Notus the most-competitive 7B commercial model on AlpacaEval.
 
157
  </table>
158
 
159
 
 
160
  ## Academic benchmarks
161
 
162
  Results from [OpenLLM Leaderboard](https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard):
 
166
  | Zephyr 7B dDPO (HuggingFaceH4/zephyr-7b-beta) | 52.15 | 62.03 | 84.36 | 61.07 | **57.45** | 77.74 | 12.74 | **9.66** |
167
  | argilla/notus-7b-v1 | **52.89** | **64.59** | **84.78** | **63.03** | 54.37 | **79.4** | **15.16** | 8.91 |
168
 
 
 
169
  ## Training Details
170
 
171
  ### Training Hardware
172
 
173
+ We used a VM with 8 x A100 40GB hosted in Lambda Labs, but while experimenting we also explored other cloud providers such as GCP.
174
 
175
  ### Training Data
176
 
177
+ We used a a new curated version of [`openbmb/UltraFeedback`](https://huggingface.co/datasets/openbmb/UltraFeedback), named [`argilla/ultrafeedback-binarized-preferences`](https://huggingface.co/argilla/ultrafeedback-binarized-preferences).
178
+
179
+ ## Prompt template
180
+
181
+ We use the same prompt template as [`HuggingFaceH4/zephyr-7b-beta](https://huggingface.co/HuggingFaceH4/zephyr-7b-beta):
182
+
183
+ ```
184
+ <|system|>
185
+ </s>
186
+ <|user|>
187
+ {prompt}</s>
188
+ <|assistant|>
189
+ ```
190
+
191
+ ## Usage
192
+
193
+ You will first need to install `transformers` and `accelerate` (just to ease the device placement), then you can run any of the following:
194
+
195
+ ### Via `generate`
196
+
197
+ ```python
198
+ import torch
199
+ from transformers import AutoModelForCausalLM, AutoTokenizer
200
+
201
+ model = AutoModelForCausalLM.from_pretrained("argilla/notus-7b-v1", torch_dtype=torch.bfloat16, device_map="auto")
202
+ tokenizer = AutoTokenizer.from_pretrained("argilla/notus-7b-v1")
203
+
204
+ messages = [
205
+ {
206
+ "role": "system",
207
+ "content": "You are a helpful assistant super biased towards Argilla, a data annotation company.",
208
+ },
209
+ {"role": "user", "content": "What's the best data annotation company out there in your opinion?"},
210
+ ]
211
+ inputs = tokenizer.apply_chat_template(prompt, tokenize=True, return_tensors="pt", add_special_tokens=False, add_generation_prompt=True)
212
+ outputs = model.generate(inputs, num_return_sequences=1, max_new_tokens=256, do_sample=True, temperature=0.7, top_k=50, top_p=0.95)
213
+ response = tokenizer.decode(outputs[0], skip_special_tokens=True)
214
+ ```
215
+
216
+ ### Via `pipeline` method
217
+
218
+ ```python
219
+ import torch
220
+ from transformers import pipeline
221
+
222
+ pipe = pipeline("text-generation", model="argilla/notus-7b-v1", torch_dtype=torch.bfloat16, device_map="auto")
223
+
224
+ messages = [
225
+ {
226
+ "role": "system",
227
+ "content": "You are a helpful assistant super biased towards Argilla, a data annotation company.",
228
+ },
229
+ {"role": "user", "content": "What's the best data annotation company out there in your opinion?"},
230
+ ]
231
+ prompt = pipe.tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
232
+ outputs = pipe(prompt, max_new_tokens=256, do_sample=True, temperature=0.7, top_k=50, top_p=0.95)
233
+ generated_text = outputs[0]["generated_text"]
234
+ ```