---
model-index:
- name: notus-7b-v1
  results: []
datasets:
- argilla/ultrafeedback-binarized-preferences
language:
- en
base_model: alignment-handbook/zephyr-7b-sft-full
library_name: transformers
pipeline_tag: text-generation
tags:
- dpo
- preference
- ultrafeedback
license: mit
---
<div align="center">
  <img src="https://cdn-uploads.huggingface.co/production/uploads/60420dccc15e823a685f2b03/CuMO3IjJfymC94_5qd15T.png" alt="Image was artificially generated by Dalle-3 via ChatGPT Pro"/>
</div>

# Model Card for Notus 7B v1
Notus is a collection of fine-tuned models using Direct Preference Optimization (DPO) and related RLHF techniques. This model is version 1, fine-tuned with DPO starting with zephyr-7b-beta's SFT model. 

Following a **data-first** approach, the only difference between Notus-7B-v1 and Zephyr-7B-beta is the preference dataset used for dDPO. In particular, we've found data issues in the original UltraFeedback dataset, leading to high-scores for bad responses. After curating several hundreds of data points, we decided to binarize the dataset using the preference ratings, instead of the original critique `overall_score`.
Using preference ratings, instead of critiques scores, led to a new dataset where the chosen response is different in ~50% of the cases.

This model wouldn't have been possible without the amazing [Alignment Handbook]( https://github.com/huggingface/alignment-handbook/tree/main/recipes/zephyr-7b-beta) and it's based on fruitful discussions with the H4 team. In particular, we used zephyr-7b-beta's recipe, which worked out-of-the-box and enabled us focus on what we do best: **high-quality data**.

Notus models are intended to be used as assistants via chat-like applications, and 
are evaluated with Chat (MT-Bench, AlpacaEval) and Academic (Open LLM Leaderboard) benchmarks for a direct comparison
with the original Zephyr dDPO model and other 7B models.

## Model Details

### Model Description

- **Developed by:** Argilla (based on HuggingFace H4 and MistralAI previous efforts and amazing work)
- **Shared by:** Argilla
- **Model type:** GPT-like 7B model DPO fine-tuned
- **Language(s) (NLP):** Mainly English
- **License:** MIT (same as Zephyr 7B-beta)
- **Finetuned from model:** [`alignment-handbook/zephyr-7b-sft-full`](https://huggingface.co/alignment-handbook/zephyr-7b-sft-full)

### Model Sources

- **Repository:** https://github.com/argilla-io/notus-7b
- **Paper:** N/A
- **Demo:** https://argilla-notus-chat-ui.hf.space/

## Performance

### Chat benchmarks
Table adapted from Zephyr-7b-β and Starling's original tables for [MT-Bench](https://huggingface.co/spaces/lmsys/mt-bench) and [AlpacaEval](https://tatsu-lab.github.io/alpaca_eval/) benchmarks. Results are shown sorted by AlpacaEval win rates and ommit some >7B for brevity.
Notus stays on par with Zephyr on MT-Bench, while surpassing Zephyr, Claude 2, and Cohere Command on AlpacaEval. Making Notus the most-competitive 7B commercial model on AlpacaEval.
<table>
    <tr>
        <th>Model</th>
        <th>Size</th>
        <th>Alignment</th>
        <th>MT-Bench (score)</th>
        <th>AlpacaEval (win rate %)</th>
        <th>License</th>
    </tr>
    <tr>
        <td>GPT-4-turbo</td>
        <td>-</td>
        <td>?</td>
        <td>9.32</td>
        <td>97.70</td>
        <td>Proprietary</td>
    </tr>
    <tr>
        <td>XwinLM 70b V0.1</td>
        <td>70B</td>
        <td>dPPO</td>
        <td>-</td>
        <td>95.57</td>
        <td>LLaMA 2 License</td>
    </tr>
    <tr>
        <td>GPT-4</td>
        <td>-</td>
        <td>RLHF</td>
        <td>8.99</td>
        <td>95.03</td>
        <td>Proprietary</td>
    </tr>
    <tr>
        <td>Tulu 2+DPO 70B V0.1</td>
        <td>70B</td>
        <td>dDPO</td>
        <td>6.29</td>
        <td>95.28</td>
        <td>Proprietary</td>
    </tr>
    <tr>
        <td>LLaMA2 Chat 70B</td>
        <td>70B</td>
        <td>RLHF</td>
        <td>6.86</td>
        <td>92.66</td>
        <td>LLaMA 2 License</td>
    </tr>
    <tr>
        <td>Starling-7B</td>
        <td>7B</td>
        <td>C-RLFT + APA</td>
        <td><strong>8.09</strong></td>
        <td><strong>91.99</strong></td>
        <td>CC-BY-NC-4.0</td>
    </tr>
    <tr style="background-color: #FFFF99;">
        <td><strong>Notus-7b-v1</strong></td>
        <td>7B</td>
        <td>dDPO</td>
        <td>7.30</td>
        <td>91.42</td>
        <td>MIT</td>
    </tr>
    <tr>
        <td>Claude 2</td>
        <td>-</td>
        <td>RLHF</td>
        <td>8.06</td>
        <td>91.36</td>
        <td>Proprietary</td>
    </tr>
    <tr>
        <td>Zephyr-7b-β</td>
        <td>7B</td>
        <td>dDPO</td>
        <td>7.34</td>
        <td>90.60</td>
        <td>MIT</td>
    </tr>
    <tr>
        <td>Cohere Command</td>
        <td>-</td>
        <td>RLHF</td>
        <td>-</td>
        <td>90.62</td>
        <td>Proprietary</td>
    </tr>
    <tr>
        <td>GPT-3.5-turbo</td>
        <td>-</td>
        <td>RLHF</td>
        <td>7.94</td>
        <td>89.37</td>
        <td>Proprietary</td>
    </tr>
</table>


## Academic benchmarks

Results from [OpenLLM Leaderboard](https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard):

| Model                                         | Average | ARC   | HellaSwag | MMLU  | TruthfulQA | Winogrande | GSM8K | DROP  |
|-----------------------------------------------|---------|-------|-----------|-------|------------|------------|-------|-------|
| Zephyr 7B dDPO (HuggingFaceH4/zephyr-7b-beta) | 52.15   | 62.03 | 84.36      | 61.07 | **57.45**  | 77.74      | 12.74 | **9.66**  |
| argilla/notus-7b-v1                           | **52.89**   | **64.59** | **84.78**  | **63.03** | 54.37       | **79.4**       | **15.16** | 8.91 |


## Training Details

### Training Hardware

We used a VM with 8 x A100 40GB hosted in Lambda Labs.

### Training Data

We used a a new curated version of [`openbmb/UltraFeedback`](https://huggingface.co/datasets/openbmb/UltraFeedback), named [`argilla/ultrafeedback-binarized-avg-rating-for-dpo`](https://huggingface.co/argilla/ultrafeedback-binarized-avg-rating-for-dpo).

### Training hyperparameters

The following hyperparameters were used during training:
- learning_rate: 5e-07
- train_batch_size: 8
- eval_batch_size: 4
- seed: 42
- distributed_type: multi-GPU
- num_devices: 8
- total_train_batch_size: 64
- total_eval_batch_size: 32
- optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
- lr_scheduler_type: linear
- lr_scheduler_warmup_ratio: 0.1
- num_epochs: 3

### Training results

| Training Loss | Epoch | Step | Validation Loss | Rewards/chosen | Rewards/rejected | Rewards/accuracies | Rewards/margins | Logps/rejected | Logps/chosen | Logits/rejected | Logits/chosen |
|:-------------:|:-----:|:----:|:---------------:|:--------------:|:----------------:|:------------------:|:---------------:|:--------------:|:------------:|:---------------:|:-------------:|
| 0.5051        | 0.1   | 100  | 0.5180          | 0.1475         | -0.3954          | 0.7183             | 0.5429          | -246.6286      | -297.5412    | -2.7438         | -3.0431       |
| 0.4321        | 0.21  | 200  | 0.4375          | 0.1353         | -0.9529          | 0.7540             | 1.0882          | -252.2036      | -297.6632    | -2.7578         | -3.0543       |
| 0.3848        | 0.31  | 300  | 0.4301          | -0.4813        | -1.8921          | 0.7302             | 1.4107          | -261.5956      | -303.8301    | -2.7592         | -3.0508       |
| 0.3777        | 0.42  | 400  | 0.4091          | -0.8597        | -2.5306          | 0.7698             | 1.6709          | -267.9805      | -307.6138    | -2.7476         | -3.0474       |
| 0.3559        | 0.52  | 500  | 0.4332          | -1.0424        | -2.6019          | 0.7619             | 1.5595          | -268.6939      | -309.4406    | -2.2960         | -2.6106       |
| 0.4178        | 0.62  | 600  | 0.3934          | -0.6434        | -2.4837          | 0.7659             | 1.8404          | -267.5121      | -305.4503    | -2.5487         | -2.8508       |
| 0.4206        | 0.73  | 700  | 0.4058          | -1.4700        | -3.5113          | 0.7857             | 2.0413          | -277.7877      | -313.7168    | -2.5679         | -2.8727       |
| 0.4323        | 0.83  | 800  | 0.3929          | -0.9025        | -2.6935          | 0.7897             | 1.7910          | -269.6095      | -308.0414    | -2.6213         | -2.9202       |
| 0.3706        | 0.93  | 900  | 0.3903          | -1.1122        | -3.0257          | 0.8056             | 1.9135          | -272.9316      | -310.1388    | -2.5428         | -2.8416       |
| 0.0496        | 1.04  | 1000 | 0.3991          | -1.4248        | -4.1245          | 0.8016             | 2.6997          | -283.9196      | -313.2651    | -2.5093         | -2.8150       |
| 0.0723        | 1.14  | 1100 | 0.3999          | -1.8789        | -4.5317          | 0.7897             | 2.6528          | -287.9914      | -317.8056    | -2.5170         | -2.8242       |
| 0.0481        | 1.25  | 1200 | 0.4191          | -2.6211        | -5.5294          | 0.7817             | 2.9083          | -297.9687      | -325.2281    | -2.5139         | -2.8109       |
| 0.0432        | 1.35  | 1300 | 0.4070          | -2.0605        | -5.0460          | 0.8056             | 2.9855          | -293.1345      | -319.6214    | -2.5153         | -2.8121       |
| 0.0402        | 1.45  | 1400 | 0.4001          | -2.2445        | -5.0942          | 0.7937             | 2.8497          | -293.6164      | -321.4614    | -2.4383         | -2.7388       |
| 0.0529        | 1.56  | 1500 | 0.4066          | -2.3499        | -5.2468          | 0.8016             | 2.8969          | -295.1426      | -322.5153    | -2.3906         | -2.6963       |
| 0.0651        | 1.66  | 1600 | 0.3962          | -2.0597        | -4.8915          | 0.8016             | 2.8318          | -291.5901      | -319.6136    | -2.3390         | -2.6469       |
| 0.0738        | 1.77  | 1700 | 0.3942          | -1.8893        | -4.6107          | 0.8135             | 2.7214          | -288.7817      | -317.9099    | -2.3532         | -2.6607       |
| 0.0597        | 1.87  | 1800 | 0.3990          | -1.8774        | -4.7221          | 0.8175             | 2.8448          | -289.8961      | -317.7905    | -2.2728         | -2.5908       |
| 0.0686        | 1.97  | 1900 | 0.3924          | -1.8745        | -4.6807          | 0.8056             | 2.8062          | -289.4821      | -317.7617    | -2.2554         | -2.5658       |
| 0.0116        | 2.08  | 2000 | 0.4260          | -2.4687        | -5.7190          | 0.7937             | 3.2503          | -299.8647      | -323.7037    | -2.2297         | -2.5347       |
| 0.0114        | 2.18  | 2100 | 0.4519          | -2.8266        | -6.3706          | 0.7976             | 3.5440          | -306.3802      | -327.2823    | -2.2185         | -2.5219       |
| 0.0073        | 2.28  | 2200 | 0.4563          | -2.9422        | -6.5564          | 0.8016             | 3.6142          | -308.2384      | -328.4384    | -2.2103         | -2.5126       |
| 0.0094        | 2.39  | 2300 | 0.4636          | -3.3246        | -7.0542          | 0.8016             | 3.7296          | -313.2165      | -332.2628    | -2.2059         | -2.5081       |
| 0.0056        | 2.49  | 2400 | 0.4745          | -3.3599        | -7.1652          | 0.7976             | 3.8053          | -314.3266      | -332.6161    | -2.1945         | -2.4943       |
| 0.0052        | 2.6   | 2500 | 0.4812          | -3.4916        | -7.3391          | 0.7976             | 3.8475          | -316.0656      | -333.9322    | -2.1888         | -2.4881       |
| 0.0065        | 2.7   | 2600 | 0.4678          | -3.2226        | -6.9887          | 0.7976             | 3.7661          | -312.5613      | -331.2425    | -2.1644         | -2.4560       |
| 0.0059        | 2.8   | 2700 | 0.4694          | -3.4307        | -7.2484          | 0.7976             | 3.8177          | -315.1584      | -333.3234    | -2.1572         | -2.4483       |
| 0.0054        | 2.91  | 2800 | 0.4707          | -3.4959        | -7.3283          | 0.8056             | 3.8324          | -315.9576      | -333.9758    | -2.1575         | -2.4491       |

### Framework versions

- Transformers 4.35.0
- Pytorch 2.1.1+cu121
- Datasets 2.14.6
- Tokenizers 0.14.1

### Evaluation during Training

- Loss: 0.4730
- Rewards/chosen: -3.5289
- Rewards/rejected: -7.3700
- Rewards/accuracies: 0.8016
- Rewards/margins: 3.8412
- Logps/rejected: -316.3751
- Logps/chosen: -334.3053
- Logits/rejected: -2.1644
- Logits/chosen: -2.4556
Model	Size	Alignment	MT-Bench (score)	AlpacaEval (win rate %)	License
GPT-4-turbo	-	?	9.32	97.70	Proprietary
XwinLM 70b V0.1	70B	dPPO	-	95.57	LLaMA 2 License
GPT-4	-	RLHF	8.99	95.03	Proprietary
Tulu 2+DPO 70B V0.1	70B	dDPO	6.29	95.28	Proprietary
LLaMA2 Chat 70B	70B	RLHF	6.86	92.66	LLaMA 2 License
Starling-7B	7B	C-RLFT + APA	8.09	91.99	CC-BY-NC-4.0
Notus-7b-v1	7B	dDPO	7.30	91.42	MIT
Claude 2	-	RLHF	8.06	91.36	Proprietary
Zephyr-7b-β	7B	dDPO	7.34	90.60	MIT
Cohere Command	-	RLHF	-	90.62	Proprietary
GPT-3.5-turbo	-	RLHF	7.94	89.37	Proprietary