argilla
/

notus-7b-v1

@@ -3,7 +3,7 @@ model-index:
 - name: notus-7b-v1
   results: []
 datasets:
-- argilla/ultrafeedback-binarized-avg-rating-for-dpo
 language:
 - en
 base_model: alignment-handbook/zephyr-7b-sft-full
@@ -20,15 +20,12 @@ license: mit
 </div>
 # Model Card for Notus 7B v1
 Notus is a collection of fine-tuned models using Direct Preference Optimization (DPO) and related RLHF techniques. This model is version 1, fine-tuned with DPO starting with zephyr-7b-beta's SFT model.
 Following a **data-first** approach, the only difference between Notus-7B-v1 and Zephyr-7B-beta is the preference dataset used for dDPO. In particular, we've found data issues in the original UltraFeedback dataset, leading to high-scores for bad responses. After curating several hundreds of data points, we decided to binarize the dataset using the preference ratings, instead of the original critique `overall_score`.
 Using preference ratings, instead of critiques scores, led to a new dataset where the chosen response is different in ~50% of the cases.
-This model wouldn't have been possible without the amazing [Alignment Handbook]( https://github.com/huggingface/alignment-handbook/tree/main/recipes/zephyr-7b-beta) and it's based on fruitful discussions with the H4 team. In particular, we used zephyr-7b-beta's recipe, which worked out-of-the-box and let us focus on what we do best: **high-quality data**.
 Notus models are intended to be used as assistants via chat-like applications, and
 are evaluated with Chat (MT-Bench, AlpacaEval) and Academic (Open LLM Leaderboard) benchmarks for a direct comparison
@@ -54,25 +51,108 @@ with the original Zephyr dDPO model and other 7B models.
 ## Performance
 ### Chat benchmarks
-Table adapted from Zephyr-7b-β original table for [MT-Bench](https://huggingface.co/spaces/lmsys/mt-bench) and [AlpacaEval](https://tatsu-lab.github.io/alpaca_eval/) benchmarks. Notus stays on par with Zephyr on MT-Bench, while surpassing Zephyr, Claude 2, and Cohere Command on AlpacaEval. Making Notus the most-competitive 7B commercial model on AlpacaEval.
-| Model | Size | Alignment | MT-Bench (score) | AlpacaEval (win rate %) |
-|-------------|-----|----|---------------|--------------|
-| MPT-Chat |  7B |dSFT |5.42| -|
-| Xwin-LMv0.1 | 7B| dPPO| 6.19| 87.83|
-| Mistral-Instructv0.1 | 7B|  - | 6.84 |-|
-| Zephyr-7b-β 🪁 | 7B | dDPO | **7.34** | 90.60 |
-| **Notus-7b-v1** | 7B | dDPO | 7.30 | **91.42** |
-| GPT-3.5-turbo | - |RLHF |7.94 |89.37|
-| Claude 2 |  - |RLHF |8.06| 91.36|
-| Cohere Command |  - |RLHF |-| 90.62|
-| GPT-4 |  -| RLHF |8.99| 95.28|
-| Falcon-Instruct |  40B |dSFT |5.17 |45.71|
-| Guanaco | 65B |  SFT |6.41| 71.80|
-| Llama2-Chat |  70B |RLHF |6.86| 92.66|
-| Vicuna v1.3 |  33B |dSFT |7.12 |88.99|
-| WizardLM v1.0 |  70B |dSFT |7.71 |-|
-| Xwin-LM v0.1 |   70B |dPPO |- |95.57|
 ## Academic benchmarks

 - name: notus-7b-v1
   results: []
 datasets:
+- argilla/ultrafeedback-binarized-preferences
 language:
 - en
 base_model: alignment-handbook/zephyr-7b-sft-full
 </div>
 # Model Card for Notus 7B v1
 Notus is a collection of fine-tuned models using Direct Preference Optimization (DPO) and related RLHF techniques. This model is version 1, fine-tuned with DPO starting with zephyr-7b-beta's SFT model.
 Following a **data-first** approach, the only difference between Notus-7B-v1 and Zephyr-7B-beta is the preference dataset used for dDPO. In particular, we've found data issues in the original UltraFeedback dataset, leading to high-scores for bad responses. After curating several hundreds of data points, we decided to binarize the dataset using the preference ratings, instead of the original critique `overall_score`.
 Using preference ratings, instead of critiques scores, led to a new dataset where the chosen response is different in ~50% of the cases.
+This model wouldn't have been possible without the amazing [Alignment Handbook]( https://github.com/huggingface/alignment-handbook/tree/main/recipes/zephyr-7b-beta) and it's based on fruitful discussions with the H4 team. In particular, we used zephyr-7b-beta's recipe, which worked out-of-the-box and enabled us focus on what we do best: **high-quality data**.
 Notus models are intended to be used as assistants via chat-like applications, and
 are evaluated with Chat (MT-Bench, AlpacaEval) and Academic (Open LLM Leaderboard) benchmarks for a direct comparison
 ## Performance
 ### Chat benchmarks
+Table adapted from Zephyr-7b-β and Starling's original tables for [MT-Bench](https://huggingface.co/spaces/lmsys/mt-bench) and [AlpacaEval](https://tatsu-lab.github.io/alpaca_eval/) benchmarks. Results are shown sorted by AlpacaEval win rates and ommit some >7B for brevity.
+Notus stays on par with Zephyr on MT-Bench, while surpassing Zephyr, Claude 2, and Cohere Command on AlpacaEval. Making Notus the most-competitive 7B commercial model on AlpacaEval.
+<table>
+    <tr>
+        <th>Model</th>
+        <th>Size</th>
+        <th>Alignment</th>
+        <th>MT-Bench (score)</th>
+        <th>AlpacaEval (win rate %)</th>
+        <th>License</th>
+    </tr>
+    <tr>
+        <td>GPT-4-turbo</td>
+        <td>-</td>
+        <td>?</td>
+        <td>9.32</td>
+        <td>97.70</td>
+        <td>Proprietary</td>
+    </tr>
+    <tr>
+        <td>XwinLM 70b V0.1</td>
+        <td>70B</td>
+        <td>dPPO</td>
+        <td>-</td>
+        <td>95.57</td>
+        <td>LLaMA 2 License</td>
+    </tr>
+    <tr>
+        <td>GPT-4</td>
+        <td>-</td>
+        <td>RLHF</td>
+        <td>8.99</td>
+        <td>95.03</td>
+        <td>Proprietary</td>
+    </tr>
+    <tr>
+        <td>Tulu 2+DPO 70B V0.1</td>
+        <td>70B</td>
+        <td>dDPO</td>
+        <td>6.29</td>
+        <td>95.28</td>
+        <td>Proprietary</td>
+    </tr>
+    <tr>
+        <td>LLaMA2 Chat 70B</td>
+        <td>70B</td>
+        <td>RLHF</td>
+        <td>6.86</td>
+        <td>92.66</td>
+        <td>LLaMA 2 License</td>
+    </tr>
+    <tr>
+        <td>Starling-7B</td>
+        <td>7B</td>
+        <td>C-RLFT + APA</td>
+        <td><strong>8.09</strong></td>
+        <td><strong>91.99</strong></td>
+        <td>CC-BY-NC-4.0</td>
+    </tr>
+    <tr style="background-color: #FFFF99;">
+        <td><strong>Notus-7b-v1</strong></td>
+        <td>7B</td>
+        <td>dDPO</td>
+        <td>7.30</td>
+        <td>91.42</td>
+        <td>MIT</td>
+    </tr>
+    <tr>
+        <td>Claude 2</td>
+        <td>-</td>
+        <td>RLHF</td>
+        <td>8.06</td>
+        <td>91.36</td>
+        <td>Proprietary</td>
+    </tr>
+    <tr>
+        <td>Zephyr-7b-β</td>
+        <td>7B</td>
+        <td>dDPO</td>
+        <td>7.34</td>
+        <td>90.60</td>
+        <td>MIT</td>
+    </tr>
+    <tr>
+        <td>Cohere Command</td>
+        <td>-</td>
+        <td>RLHF</td>
+        <td>-</td>
+        <td>90.62</td>
+        <td>Proprietary</td>
+    </tr>
+    <tr>
+        <td>GPT-3.5-turbo</td>
+        <td>-</td>
+        <td>RLHF</td>
+        <td>7.94</td>
+        <td>89.37</td>
+        <td>Proprietary</td>
+    </tr>
+</table>
 ## Academic benchmarks