argilla
/

notus-7b-v1

@@ -22,18 +22,19 @@ license: apache-2.0
   <img src="https://cdn-uploads.huggingface.co/production/uploads/60f0608166e5701b80ed3f02/LU-vKiC0R7UxxITrwE1F_.png" alt="Image was artificially generated by Dalle-3 via ChatGPT Pro"/>
 </div>
-Notus is going to be a collection of fine-tuned models using DPO, similarly to Zephyr, but mainly focused
-on the Direct Preference Optimization (DPO) step, aiming to incorporate preference feedback into the LLMs
-when fine-tuning those. Notus models are intended to be used as assistants via chat-like applications, and
-are evaluated with the MT-Bench, AlpacaEval, and LM Evaluation Harness benchmarks, to be directly compared
-with Zephyr fine-tuned models also using DPO.
 ## Model Details
 ### Model Description
-- **Developed by:** Argilla, Inc. (based on HuggingFace H4 and MistralAI previous efforts and amazing work)
-- **Shared by:** Argilla, Inc.
 - **Model type:** GPT-like 7B model DPO fine-tuned
 - **Language(s) (NLP):** Mainly English
 - **License:** Apache 2.0 (same as Zephyr 7B SFT and Mistral 7B v0.1)
@@ -45,30 +46,38 @@ with Zephyr fine-tuned models also using DPO.
 - **Paper:** N/A
 - **Demo:** https://argilla-notus-chat-ui.hf.space/
-### Model Date
-Notus 7B v1 was trained along November, 2023. And the data as generated by GPT-4 without the usage of external resources, has a cutoff at September, 2021.
-## Evaluation
-### LM Eval Harness
-We ran the evaluation using [`EleutherAI/lm-eval-harness`](https://github.com/EleutherAI/lm-evaluation-harness/tree/big-refactor) from the `big-refactor` branch, aiming to mimic the [Open LLM Leaderboard by HuggingFace H4](https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard), but running everything on our VMs instead, as we're still experimenting.
-From a first evaluation on the benchmark, we could see that Notus 7B DPO **slightly improved** compared to Zephyr 7B Beta/Alpha and Mistral 7B as we see from the average metric of 7 tasks from the leaderboard.
-| Model | Average ⬆️ | ARC (25-s) ⬆️ | HellaSwag (10-s) ⬆️ | MMLU (5-s) ⬆️ | TruthfulQA (MC2) (0-s) ⬇️ | Winogrande (5-s) ⬇️ | GSM8K (5-s) ⬆️ | DROP (3-s) ⬇️ |
-| --- | --- | --- | --- | --- | --- | --- | --- | --- |
-|[mistralai/Mistral-7B-v0.1](https://huggingface.co/mistralai/Mistral-7B-v0.1) | 50.32 | 59.58 | 83.31 | 64.16 | 42.15 | 78.37 | 18.12 | 6.14 |
-|[HuggingFaceH4/zephyr-7b-alpha](https://huggingface.co/HuggingFaceH4/zephyr-7b-alpha) | 52.4 | 61.01 | 84.04 | 61.39 | 57.9 | 78.61 | 14.03 | 9.82 |
-|[HuggingFaceH4/zephyr-7b-beta](https://huggingface.co/HuggingFaceH4/zephyr-7b-beta) | 52.15 | 62.03 | 84.36 | 61.07 | 57.45 | 77.74 | 12.74 | 9.66 |
-| **Ours** | **54.09** | 64.25 | 84.90 | 61.69 | 52.77 | 74.51 | 39.5 | 0.98 |
-Anyway, we will also add our model to the [Open LLM Leaderboard](https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard) queue to be evaluated on Hugging Face's end to ensure that the produced results match the same ones, as we found some inconsistencies for DROP using the `big-refactor` branch on `lm-eval-harness`.
-### MT Bench (Coming soon!)
-### Alpaca Eval (Coming soon!)
 ## Training Details
@@ -78,7 +87,7 @@ We used a VM with 8 x A100 40GB hosted in Lambda Labs.
 ### Training Data
-We used a slightly curated version of [`openbmb/UltraFeedback`](https://huggingface.co/datasets/openbmb/UltraFeedback), named [`argilla/ultrafeedback-binarized-avg-rating-for-dpo`](https://huggingface.co/argilla/ultrafeedback-binarized-avg-rating-for-dpo).
 ### Training hyperparameters

   <img src="https://cdn-uploads.huggingface.co/production/uploads/60f0608166e5701b80ed3f02/LU-vKiC0R7UxxITrwE1F_.png" alt="Image was artificially generated by Dalle-3 via ChatGPT Pro"/>
 </div>
+Notus is a collection of fine-tuned models using Direct Preference Optimization (DPO) and related RLHF techniques. This model is version 1, fine-tuned with DPO starting with zephyr-7b-beta's SFT model. Following a **data-first** approach, the only difference between Notus-7B-v1 and Zephyr-7B-beta is the preference dataset used for dDPO. In particular, we've found data issues in the original UltraFeedback dataset, leading to high-scores for bad responses. After curating several hundreds of data points, we decided to binarize the dataset using the preference ratings, instead of the original critique `overall_score`.
+Using preference ratings, instead of critiques scores, led to a new dataset where the chosen response is different in ~50% of the cases.
+This model wouldn't have been possible without the amazing [Alignment Handbook]( https://github.com/huggingface/alignment-handbook/tree/main/recipes/zephyr-7b-beta) and it's based on fruitful discussions with the H4 team. In particular, we used zephyr-7b-beta's recipe, which worked out-of-the-box and let us focus on what we do best: **high-quality data**.
+Notus models are intended to be used as assistants via chat-like applications, and
+are evaluated with Chat (MT-Bench, AlpacaEval) and Academic (Open LLM Leaderboard) benchmarks, for a direct comparison
+with the original Zephyr dDPO model.
 ## Model Details
 ### Model Description
+- **Developed by:** Argilla (based on HuggingFace H4 and MistralAI previous efforts and amazing work)
+- **Shared by:** Argilla
 - **Model type:** GPT-like 7B model DPO fine-tuned
 - **Language(s) (NLP):** Mainly English
 - **License:** Apache 2.0 (same as Zephyr 7B SFT and Mistral 7B v0.1)
 - **Paper:** N/A
 - **Demo:** https://argilla-notus-chat-ui.hf.space/
+## Performance
+### Chat benchmarks
+This shows the updated table, based on Zephyr-7b-β original table for [MT-Bench](https://huggingface.co/spaces/lmsys/mt-bench) and [AlpacaEval](https://tatsu-lab.github.io/alpaca_eval/) benchmarks:
+| Model | Size | Alignment | MT-Bench (score) | AlpacaEval (win rate %) |
+|-------------|-----|----|---------------|--------------|
+| StableLM-Tuned-α | 7B| dSFT |2.75| -|
+| MPT-Chat |  7B |dSFT |5.42| -|
+| Xwin-LMv0.1 | 7B| dPPO| 6.19| 87.83|
+| Mistral-Instructv0.1 | 7B|  - | 6.84 |-|
+| Zephyr-7b-α |7B|  dDPO| 6.88| -|
+| Zephyr-7b-β 🪁 | **7B** | **dDPO** | **7.34** | 90.60 |
+| **Notus-7b-β** 🪁 | **7B** | **dDPO** | 7.30 | **91.42** |
+| Falcon-Instruct |  40B |dSFT |5.17 |45.71|
+| Guanaco | 65B |  SFT |6.41| 71.80|
+| Llama2-Chat |  70B |RLHF |6.86| 92.66|
+| Vicuna v1.3 |  33B |dSFT |7.12 |88.99|
+| WizardLM v1.0 |  70B |dSFT |7.71 |-|
+| Xwin-LM v0.1 |   70B |dPPO |- |95.57|
+| GPT-3.5-turbo | - |RLHF |7.94 |89.37|
+| Claude 2 |  - |RLHF |8.06| 91.36|
+| GPT-4 |  -| RLHF |8.99| 95.28|
+## Academic benchmarks
+| Model                                         | Average | ARC   | HellaSwag | MMLU  | TruthfulQA | Winogrande | GSM8K | DROP  |
+|-----------------------------------------------|---------|-------|-----------|-------|------------|------------|-------|-------|
+| Zephyr 7B dDPO (HuggingFaceH4/zephyr-7b-beta) | 52.15   | 62.03 | 84.36      | 61.07 | **57.45**  | 77.74      | 12.74 | **9.66**  |
+| argilla/notus-7b-v1                           | **52.89**   | **64.59** | **84.78**  | **63.03** | 54.37       | **79.4**       | **15.16** | 8.91 |
 ## Training Details
 ### Training Data
+We used a a new curated version of [`openbmb/UltraFeedback`](https://huggingface.co/datasets/openbmb/UltraFeedback), named [`argilla/ultrafeedback-binarized-avg-rating-for-dpo`](https://huggingface.co/argilla/ultrafeedback-binarized-avg-rating-for-dpo).
 ### Training hyperparameters