Commit
·
cfa4403
1
Parent(s):
89f594b
Update README.md
Browse files
README.md
CHANGED
@@ -22,18 +22,19 @@ license: apache-2.0
|
|
22 |
<img src="https://cdn-uploads.huggingface.co/production/uploads/60f0608166e5701b80ed3f02/LU-vKiC0R7UxxITrwE1F_.png" alt="Image was artificially generated by Dalle-3 via ChatGPT Pro"/>
|
23 |
</div>
|
24 |
|
25 |
-
Notus is
|
26 |
-
|
27 |
-
|
28 |
-
are
|
29 |
-
with
|
|
|
30 |
|
31 |
## Model Details
|
32 |
|
33 |
### Model Description
|
34 |
|
35 |
-
- **Developed by:** Argilla
|
36 |
-
- **Shared by:** Argilla
|
37 |
- **Model type:** GPT-like 7B model DPO fine-tuned
|
38 |
- **Language(s) (NLP):** Mainly English
|
39 |
- **License:** Apache 2.0 (same as Zephyr 7B SFT and Mistral 7B v0.1)
|
@@ -45,30 +46,38 @@ with Zephyr fine-tuned models also using DPO.
|
|
45 |
- **Paper:** N/A
|
46 |
- **Demo:** https://argilla-notus-chat-ui.hf.space/
|
47 |
|
48 |
-
|
49 |
|
50 |
-
|
|
|
51 |
|
52 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
53 |
|
54 |
-
|
55 |
|
56 |
-
|
|
|
|
|
|
|
57 |
|
58 |
-
From a first evaluation on the benchmark, we could see that Notus 7B DPO **slightly improved** compared to Zephyr 7B Beta/Alpha and Mistral 7B as we see from the average metric of 7 tasks from the leaderboard.
|
59 |
|
60 |
-
| Model | Average ⬆️ | ARC (25-s) ⬆️ | HellaSwag (10-s) ⬆️ | MMLU (5-s) ⬆️ | TruthfulQA (MC2) (0-s) ⬇️ | Winogrande (5-s) ⬇️ | GSM8K (5-s) ⬆️ | DROP (3-s) ⬇️ |
|
61 |
-
| --- | --- | --- | --- | --- | --- | --- | --- | --- |
|
62 |
-
|[mistralai/Mistral-7B-v0.1](https://huggingface.co/mistralai/Mistral-7B-v0.1) | 50.32 | 59.58 | 83.31 | 64.16 | 42.15 | 78.37 | 18.12 | 6.14 |
|
63 |
-
|[HuggingFaceH4/zephyr-7b-alpha](https://huggingface.co/HuggingFaceH4/zephyr-7b-alpha) | 52.4 | 61.01 | 84.04 | 61.39 | 57.9 | 78.61 | 14.03 | 9.82 |
|
64 |
-
|[HuggingFaceH4/zephyr-7b-beta](https://huggingface.co/HuggingFaceH4/zephyr-7b-beta) | 52.15 | 62.03 | 84.36 | 61.07 | 57.45 | 77.74 | 12.74 | 9.66 |
|
65 |
-
| **Ours** | **54.09** | 64.25 | 84.90 | 61.69 | 52.77 | 74.51 | 39.5 | 0.98 |
|
66 |
-
|
67 |
-
Anyway, we will also add our model to the [Open LLM Leaderboard](https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard) queue to be evaluated on Hugging Face's end to ensure that the produced results match the same ones, as we found some inconsistencies for DROP using the `big-refactor` branch on `lm-eval-harness`.
|
68 |
-
|
69 |
-
### MT Bench (Coming soon!)
|
70 |
-
|
71 |
-
### Alpaca Eval (Coming soon!)
|
72 |
|
73 |
## Training Details
|
74 |
|
@@ -78,7 +87,7 @@ We used a VM with 8 x A100 40GB hosted in Lambda Labs.
|
|
78 |
|
79 |
### Training Data
|
80 |
|
81 |
-
We used a
|
82 |
|
83 |
### Training hyperparameters
|
84 |
|
|
|
22 |
<img src="https://cdn-uploads.huggingface.co/production/uploads/60f0608166e5701b80ed3f02/LU-vKiC0R7UxxITrwE1F_.png" alt="Image was artificially generated by Dalle-3 via ChatGPT Pro"/>
|
23 |
</div>
|
24 |
|
25 |
+
Notus is a collection of fine-tuned models using Direct Preference Optimization (DPO) and related RLHF techniques. This model is version 1, fine-tuned with DPO starting with zephyr-7b-beta's SFT model. Following a **data-first** approach, the only difference between Notus-7B-v1 and Zephyr-7B-beta is the preference dataset used for dDPO. In particular, we've found data issues in the original UltraFeedback dataset, leading to high-scores for bad responses. After curating several hundreds of data points, we decided to binarize the dataset using the preference ratings, instead of the original critique `overall_score`.
|
26 |
+
Using preference ratings, instead of critiques scores, led to a new dataset where the chosen response is different in ~50% of the cases.
|
27 |
+
This model wouldn't have been possible without the amazing [Alignment Handbook]( https://github.com/huggingface/alignment-handbook/tree/main/recipes/zephyr-7b-beta) and it's based on fruitful discussions with the H4 team. In particular, we used zephyr-7b-beta's recipe, which worked out-of-the-box and let us focus on what we do best: **high-quality data**.
|
28 |
+
Notus models are intended to be used as assistants via chat-like applications, and
|
29 |
+
are evaluated with Chat (MT-Bench, AlpacaEval) and Academic (Open LLM Leaderboard) benchmarks, for a direct comparison
|
30 |
+
with the original Zephyr dDPO model.
|
31 |
|
32 |
## Model Details
|
33 |
|
34 |
### Model Description
|
35 |
|
36 |
+
- **Developed by:** Argilla (based on HuggingFace H4 and MistralAI previous efforts and amazing work)
|
37 |
+
- **Shared by:** Argilla
|
38 |
- **Model type:** GPT-like 7B model DPO fine-tuned
|
39 |
- **Language(s) (NLP):** Mainly English
|
40 |
- **License:** Apache 2.0 (same as Zephyr 7B SFT and Mistral 7B v0.1)
|
|
|
46 |
- **Paper:** N/A
|
47 |
- **Demo:** https://argilla-notus-chat-ui.hf.space/
|
48 |
|
49 |
+
## Performance
|
50 |
|
51 |
+
### Chat benchmarks
|
52 |
+
This shows the updated table, based on Zephyr-7b-β original table for [MT-Bench](https://huggingface.co/spaces/lmsys/mt-bench) and [AlpacaEval](https://tatsu-lab.github.io/alpaca_eval/) benchmarks:
|
53 |
|
54 |
+
| Model | Size | Alignment | MT-Bench (score) | AlpacaEval (win rate %) |
|
55 |
+
|-------------|-----|----|---------------|--------------|
|
56 |
+
| StableLM-Tuned-α | 7B| dSFT |2.75| -|
|
57 |
+
| MPT-Chat | 7B |dSFT |5.42| -|
|
58 |
+
| Xwin-LMv0.1 | 7B| dPPO| 6.19| 87.83|
|
59 |
+
| Mistral-Instructv0.1 | 7B| - | 6.84 |-|
|
60 |
+
| Zephyr-7b-α |7B| dDPO| 6.88| -|
|
61 |
+
| Zephyr-7b-β 🪁 | **7B** | **dDPO** | **7.34** | 90.60 |
|
62 |
+
| **Notus-7b-β** 🪁 | **7B** | **dDPO** | 7.30 | **91.42** |
|
63 |
+
| Falcon-Instruct | 40B |dSFT |5.17 |45.71|
|
64 |
+
| Guanaco | 65B | SFT |6.41| 71.80|
|
65 |
+
| Llama2-Chat | 70B |RLHF |6.86| 92.66|
|
66 |
+
| Vicuna v1.3 | 33B |dSFT |7.12 |88.99|
|
67 |
+
| WizardLM v1.0 | 70B |dSFT |7.71 |-|
|
68 |
+
| Xwin-LM v0.1 | 70B |dPPO |- |95.57|
|
69 |
+
| GPT-3.5-turbo | - |RLHF |7.94 |89.37|
|
70 |
+
| Claude 2 | - |RLHF |8.06| 91.36|
|
71 |
+
| GPT-4 | -| RLHF |8.99| 95.28|
|
72 |
|
73 |
+
## Academic benchmarks
|
74 |
|
75 |
+
| Model | Average | ARC | HellaSwag | MMLU | TruthfulQA | Winogrande | GSM8K | DROP |
|
76 |
+
|-----------------------------------------------|---------|-------|-----------|-------|------------|------------|-------|-------|
|
77 |
+
| Zephyr 7B dDPO (HuggingFaceH4/zephyr-7b-beta) | 52.15 | 62.03 | 84.36 | 61.07 | **57.45** | 77.74 | 12.74 | **9.66** |
|
78 |
+
| argilla/notus-7b-v1 | **52.89** | **64.59** | **84.78** | **63.03** | 54.37 | **79.4** | **15.16** | 8.91 |
|
79 |
|
|
|
80 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
81 |
|
82 |
## Training Details
|
83 |
|
|
|
87 |
|
88 |
### Training Data
|
89 |
|
90 |
+
We used a a new curated version of [`openbmb/UltraFeedback`](https://huggingface.co/datasets/openbmb/UltraFeedback), named [`argilla/ultrafeedback-binarized-avg-rating-for-dpo`](https://huggingface.co/argilla/ultrafeedback-binarized-avg-rating-for-dpo).
|
91 |
|
92 |
### Training hyperparameters
|
93 |
|