dvilasuero HF staff commited on
Commit
cfa4403
1 Parent(s): 89f594b

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +35 -26
README.md CHANGED
@@ -22,18 +22,19 @@ license: apache-2.0
22
  <img src="https://cdn-uploads.huggingface.co/production/uploads/60f0608166e5701b80ed3f02/LU-vKiC0R7UxxITrwE1F_.png" alt="Image was artificially generated by Dalle-3 via ChatGPT Pro"/>
23
  </div>
24
 
25
- Notus is going to be a collection of fine-tuned models using DPO, similarly to Zephyr, but mainly focused
26
- on the Direct Preference Optimization (DPO) step, aiming to incorporate preference feedback into the LLMs
27
- when fine-tuning those. Notus models are intended to be used as assistants via chat-like applications, and
28
- are evaluated with the MT-Bench, AlpacaEval, and LM Evaluation Harness benchmarks, to be directly compared
29
- with Zephyr fine-tuned models also using DPO.
 
30
 
31
  ## Model Details
32
 
33
  ### Model Description
34
 
35
- - **Developed by:** Argilla, Inc. (based on HuggingFace H4 and MistralAI previous efforts and amazing work)
36
- - **Shared by:** Argilla, Inc.
37
  - **Model type:** GPT-like 7B model DPO fine-tuned
38
  - **Language(s) (NLP):** Mainly English
39
  - **License:** Apache 2.0 (same as Zephyr 7B SFT and Mistral 7B v0.1)
@@ -45,30 +46,38 @@ with Zephyr fine-tuned models also using DPO.
45
  - **Paper:** N/A
46
  - **Demo:** https://argilla-notus-chat-ui.hf.space/
47
 
48
- ### Model Date
49
 
50
- Notus 7B v1 was trained along November, 2023. And the data as generated by GPT-4 without the usage of external resources, has a cutoff at September, 2021.
 
51
 
52
- ## Evaluation
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
53
 
54
- ### LM Eval Harness
55
 
56
- We ran the evaluation using [`EleutherAI/lm-eval-harness`](https://github.com/EleutherAI/lm-evaluation-harness/tree/big-refactor) from the `big-refactor` branch, aiming to mimic the [Open LLM Leaderboard by HuggingFace H4](https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard), but running everything on our VMs instead, as we're still experimenting.
 
 
 
57
 
58
- From a first evaluation on the benchmark, we could see that Notus 7B DPO **slightly improved** compared to Zephyr 7B Beta/Alpha and Mistral 7B as we see from the average metric of 7 tasks from the leaderboard.
59
 
60
- | Model | Average ⬆️ | ARC (25-s) ⬆️ | HellaSwag (10-s) ⬆️ | MMLU (5-s) ⬆️ | TruthfulQA (MC2) (0-s) ⬇️ | Winogrande (5-s) ⬇️ | GSM8K (5-s) ⬆️ | DROP (3-s) ⬇️ |
61
- | --- | --- | --- | --- | --- | --- | --- | --- | --- |
62
- |[mistralai/Mistral-7B-v0.1](https://huggingface.co/mistralai/Mistral-7B-v0.1) | 50.32 | 59.58 | 83.31 | 64.16 | 42.15 | 78.37 | 18.12 | 6.14 |
63
- |[HuggingFaceH4/zephyr-7b-alpha](https://huggingface.co/HuggingFaceH4/zephyr-7b-alpha) | 52.4 | 61.01 | 84.04 | 61.39 | 57.9 | 78.61 | 14.03 | 9.82 |
64
- |[HuggingFaceH4/zephyr-7b-beta](https://huggingface.co/HuggingFaceH4/zephyr-7b-beta) | 52.15 | 62.03 | 84.36 | 61.07 | 57.45 | 77.74 | 12.74 | 9.66 |
65
- | **Ours** | **54.09** | 64.25 | 84.90 | 61.69 | 52.77 | 74.51 | 39.5 | 0.98 |
66
-
67
- Anyway, we will also add our model to the [Open LLM Leaderboard](https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard) queue to be evaluated on Hugging Face's end to ensure that the produced results match the same ones, as we found some inconsistencies for DROP using the `big-refactor` branch on `lm-eval-harness`.
68
-
69
- ### MT Bench (Coming soon!)
70
-
71
- ### Alpaca Eval (Coming soon!)
72
 
73
  ## Training Details
74
 
@@ -78,7 +87,7 @@ We used a VM with 8 x A100 40GB hosted in Lambda Labs.
78
 
79
  ### Training Data
80
 
81
- We used a slightly curated version of [`openbmb/UltraFeedback`](https://huggingface.co/datasets/openbmb/UltraFeedback), named [`argilla/ultrafeedback-binarized-avg-rating-for-dpo`](https://huggingface.co/argilla/ultrafeedback-binarized-avg-rating-for-dpo).
82
 
83
  ### Training hyperparameters
84
 
 
22
  <img src="https://cdn-uploads.huggingface.co/production/uploads/60f0608166e5701b80ed3f02/LU-vKiC0R7UxxITrwE1F_.png" alt="Image was artificially generated by Dalle-3 via ChatGPT Pro"/>
23
  </div>
24
 
25
+ Notus is a collection of fine-tuned models using Direct Preference Optimization (DPO) and related RLHF techniques. This model is version 1, fine-tuned with DPO starting with zephyr-7b-beta's SFT model. Following a **data-first** approach, the only difference between Notus-7B-v1 and Zephyr-7B-beta is the preference dataset used for dDPO. In particular, we've found data issues in the original UltraFeedback dataset, leading to high-scores for bad responses. After curating several hundreds of data points, we decided to binarize the dataset using the preference ratings, instead of the original critique `overall_score`.
26
+ Using preference ratings, instead of critiques scores, led to a new dataset where the chosen response is different in ~50% of the cases.
27
+ This model wouldn't have been possible without the amazing [Alignment Handbook]( https://github.com/huggingface/alignment-handbook/tree/main/recipes/zephyr-7b-beta) and it's based on fruitful discussions with the H4 team. In particular, we used zephyr-7b-beta's recipe, which worked out-of-the-box and let us focus on what we do best: **high-quality data**.
28
+ Notus models are intended to be used as assistants via chat-like applications, and
29
+ are evaluated with Chat (MT-Bench, AlpacaEval) and Academic (Open LLM Leaderboard) benchmarks, for a direct comparison
30
+ with the original Zephyr dDPO model.
31
 
32
  ## Model Details
33
 
34
  ### Model Description
35
 
36
+ - **Developed by:** Argilla (based on HuggingFace H4 and MistralAI previous efforts and amazing work)
37
+ - **Shared by:** Argilla
38
  - **Model type:** GPT-like 7B model DPO fine-tuned
39
  - **Language(s) (NLP):** Mainly English
40
  - **License:** Apache 2.0 (same as Zephyr 7B SFT and Mistral 7B v0.1)
 
46
  - **Paper:** N/A
47
  - **Demo:** https://argilla-notus-chat-ui.hf.space/
48
 
49
+ ## Performance
50
 
51
+ ### Chat benchmarks
52
+ This shows the updated table, based on Zephyr-7b-β original table for [MT-Bench](https://huggingface.co/spaces/lmsys/mt-bench) and [AlpacaEval](https://tatsu-lab.github.io/alpaca_eval/) benchmarks:
53
 
54
+ | Model | Size | Alignment | MT-Bench (score) | AlpacaEval (win rate %) |
55
+ |-------------|-----|----|---------------|--------------|
56
+ | StableLM-Tuned-α | 7B| dSFT |2.75| -|
57
+ | MPT-Chat | 7B |dSFT |5.42| -|
58
+ | Xwin-LMv0.1 | 7B| dPPO| 6.19| 87.83|
59
+ | Mistral-Instructv0.1 | 7B| - | 6.84 |-|
60
+ | Zephyr-7b-α |7B| dDPO| 6.88| -|
61
+ | Zephyr-7b-β 🪁 | **7B** | **dDPO** | **7.34** | 90.60 |
62
+ | **Notus-7b-β** 🪁 | **7B** | **dDPO** | 7.30 | **91.42** |
63
+ | Falcon-Instruct | 40B |dSFT |5.17 |45.71|
64
+ | Guanaco | 65B | SFT |6.41| 71.80|
65
+ | Llama2-Chat | 70B |RLHF |6.86| 92.66|
66
+ | Vicuna v1.3 | 33B |dSFT |7.12 |88.99|
67
+ | WizardLM v1.0 | 70B |dSFT |7.71 |-|
68
+ | Xwin-LM v0.1 | 70B |dPPO |- |95.57|
69
+ | GPT-3.5-turbo | - |RLHF |7.94 |89.37|
70
+ | Claude 2 | - |RLHF |8.06| 91.36|
71
+ | GPT-4 | -| RLHF |8.99| 95.28|
72
 
73
+ ## Academic benchmarks
74
 
75
+ | Model | Average | ARC | HellaSwag | MMLU | TruthfulQA | Winogrande | GSM8K | DROP |
76
+ |-----------------------------------------------|---------|-------|-----------|-------|------------|------------|-------|-------|
77
+ | Zephyr 7B dDPO (HuggingFaceH4/zephyr-7b-beta) | 52.15 | 62.03 | 84.36 | 61.07 | **57.45** | 77.74 | 12.74 | **9.66** |
78
+ | argilla/notus-7b-v1 | **52.89** | **64.59** | **84.78** | **63.03** | 54.37 | **79.4** | **15.16** | 8.91 |
79
 
 
80
 
 
 
 
 
 
 
 
 
 
 
 
 
81
 
82
  ## Training Details
83
 
 
87
 
88
  ### Training Data
89
 
90
+ We used a a new curated version of [`openbmb/UltraFeedback`](https://huggingface.co/datasets/openbmb/UltraFeedback), named [`argilla/ultrafeedback-binarized-avg-rating-for-dpo`](https://huggingface.co/argilla/ultrafeedback-binarized-avg-rating-for-dpo).
91
 
92
  ### Training hyperparameters
93