---
model-index:
- name: notus-7b-v1
  results: []
datasets:
- argilla/ultrafeedback-binarized-avg-rating-for-dpo
language:
- en
base_model: alignment-handbook/zephyr-7b-sft-full
library_name: transformers
pipeline_tag: text-generation
tags:
- dpo
- preference
- ultrafeedback
license: apache-2.0
---

# Model Card for Notus 7B v1

<div align="center">
  <img src="https://cdn-uploads.huggingface.co/production/uploads/60f0608166e5701b80ed3f02/LU-vKiC0R7UxxITrwE1F_.png" alt="Image was artificially generated by Dalle-3 via ChatGPT Pro"/>
</div>

Notus is a collection of fine-tuned models using Direct Preference Optimization (DPO) and related RLHF techniques. This model is version 1, fine-tuned with DPO starting with zephyr-7b-beta's SFT model. Following a **data-first** approach, the only difference between Notus-7B-v1 and Zephyr-7B-beta is the preference dataset used for dDPO. In particular, we've found data issues in the original UltraFeedback dataset, leading to high-scores for bad responses. After curating several hundreds of data points, we decided to binarize the dataset using the preference ratings, instead of the original critique `overall_score`.
Using preference ratings, instead of critiques scores, led to a new dataset where the chosen response is different in ~50% of the cases.
This model wouldn't have been possible without the amazing [Alignment Handbook]( https://github.com/huggingface/alignment-handbook/tree/main/recipes/zephyr-7b-beta) and it's based on fruitful discussions with the H4 team. In particular, we used zephyr-7b-beta's recipe, which worked out-of-the-box and let us focus on what we do best: **high-quality data**.
Notus models are intended to be used as assistants via chat-like applications, and 
are evaluated with Chat (MT-Bench, AlpacaEval) and Academic (Open LLM Leaderboard) benchmarks, for a direct comparison
with the original Zephyr dDPO model.

## Model Details

### Model Description

- **Developed by:** Argilla (based on HuggingFace H4 and MistralAI previous efforts and amazing work)
- **Shared by:** Argilla
- **Model type:** GPT-like 7B model DPO fine-tuned
- **Language(s) (NLP):** Mainly English
- **License:** Apache 2.0 (same as Zephyr 7B SFT and Mistral 7B v0.1)
- **Finetuned from model:** [`alignment-handbook/zephyr-7b-sft-full`](https://huggingface.co/alignment-handbook/zephyr-7b-sft-full)

### Model Sources

- **Repository:** https://github.com/argilla-io/notus-7b
- **Paper:** N/A
- **Demo:** https://argilla-notus-chat-ui.hf.space/

## Performance

### Chat benchmarks
This shows the updated table, based on Zephyr-7b-β original table for [MT-Bench](https://huggingface.co/spaces/lmsys/mt-bench) and [AlpacaEval](https://tatsu-lab.github.io/alpaca_eval/) benchmarks:

| Model | Size | Alignment | MT-Bench (score) | AlpacaEval (win rate %) |
|-------------|-----|----|---------------|--------------|
| StableLM-Tuned-α | 7B| dSFT |2.75| -|
| MPT-Chat |  7B |dSFT |5.42| -|
| Xwin-LMv0.1 | 7B| dPPO| 6.19| 87.83|
| Mistral-Instructv0.1 | 7B|  - | 6.84 |-|
| Zephyr-7b-α |7B|  dDPO| 6.88| -|
| Zephyr-7b-β 🪁 | **7B** | **dDPO** | **7.34** | 90.60 |
| **Notus-7b-β** 🪁 | **7B** | **dDPO** | 7.30 | **91.42** |
| Falcon-Instruct |  40B |dSFT |5.17 |45.71|
| Guanaco | 65B |  SFT |6.41| 71.80|
| Llama2-Chat |  70B |RLHF |6.86| 92.66|
| Vicuna v1.3 |  33B |dSFT |7.12 |88.99|
| WizardLM v1.0 |  70B |dSFT |7.71 |-|
| Xwin-LM v0.1 |   70B |dPPO |- |95.57|
| GPT-3.5-turbo | - |RLHF |7.94 |89.37|
| Claude 2 |  - |RLHF |8.06| 91.36|
| GPT-4 |  -| RLHF |8.99| 95.28|

## Academic benchmarks

| Model                                         | Average | ARC   | HellaSwag | MMLU  | TruthfulQA | Winogrande | GSM8K | DROP  |
|-----------------------------------------------|---------|-------|-----------|-------|------------|------------|-------|-------|
| Zephyr 7B dDPO (HuggingFaceH4/zephyr-7b-beta) | 52.15   | 62.03 | 84.36      | 61.07 | **57.45**  | 77.74      | 12.74 | **9.66**  |
| argilla/notus-7b-v1                           | **52.89**   | **64.59** | **84.78**  | **63.03** | 54.37       | **79.4**       | **15.16** | 8.91 |


## Training Details

### Training Hardware

We used a VM with 8 x A100 40GB hosted in Lambda Labs.

### Training Data

We used a a new curated version of [`openbmb/UltraFeedback`](https://huggingface.co/datasets/openbmb/UltraFeedback), named [`argilla/ultrafeedback-binarized-avg-rating-for-dpo`](https://huggingface.co/argilla/ultrafeedback-binarized-avg-rating-for-dpo).

### Training hyperparameters

The following hyperparameters were used during training:
- learning_rate: 5e-07
- train_batch_size: 8
- eval_batch_size: 4
- seed: 42
- distributed_type: multi-GPU
- num_devices: 8
- total_train_batch_size: 64
- total_eval_batch_size: 32
- optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
- lr_scheduler_type: linear
- lr_scheduler_warmup_ratio: 0.1
- num_epochs: 3

### Training results

| Training Loss | Epoch | Step | Validation Loss | Rewards/chosen | Rewards/rejected | Rewards/accuracies | Rewards/margins | Logps/rejected | Logps/chosen | Logits/rejected | Logits/chosen |
|:-------------:|:-----:|:----:|:---------------:|:--------------:|:----------------:|:------------------:|:---------------:|:--------------:|:------------:|:---------------:|:-------------:|
| 0.5051        | 0.1   | 100  | 0.5180          | 0.1475         | -0.3954          | 0.7183             | 0.5429          | -246.6286      | -297.5412    | -2.7438         | -3.0431       |
| 0.4321        | 0.21  | 200  | 0.4375          | 0.1353         | -0.9529          | 0.7540             | 1.0882          | -252.2036      | -297.6632    | -2.7578         | -3.0543       |
| 0.3848        | 0.31  | 300  | 0.4301          | -0.4813        | -1.8921          | 0.7302             | 1.4107          | -261.5956      | -303.8301    | -2.7592         | -3.0508       |
| 0.3777        | 0.42  | 400  | 0.4091          | -0.8597        | -2.5306          | 0.7698             | 1.6709          | -267.9805      | -307.6138    | -2.7476         | -3.0474       |
| 0.3559        | 0.52  | 500  | 0.4332          | -1.0424        | -2.6019          | 0.7619             | 1.5595          | -268.6939      | -309.4406    | -2.2960         | -2.6106       |
| 0.4178        | 0.62  | 600  | 0.3934          | -0.6434        | -2.4837          | 0.7659             | 1.8404          | -267.5121      | -305.4503    | -2.5487         | -2.8508       |
| 0.4206        | 0.73  | 700  | 0.4058          | -1.4700        | -3.5113          | 0.7857             | 2.0413          | -277.7877      | -313.7168    | -2.5679         | -2.8727       |
| 0.4323        | 0.83  | 800  | 0.3929          | -0.9025        | -2.6935          | 0.7897             | 1.7910          | -269.6095      | -308.0414    | -2.6213         | -2.9202       |
| 0.3706        | 0.93  | 900  | 0.3903          | -1.1122        | -3.0257          | 0.8056             | 1.9135          | -272.9316      | -310.1388    | -2.5428         | -2.8416       |
| 0.0496        | 1.04  | 1000 | 0.3991          | -1.4248        | -4.1245          | 0.8016             | 2.6997          | -283.9196      | -313.2651    | -2.5093         | -2.8150       |
| 0.0723        | 1.14  | 1100 | 0.3999          | -1.8789        | -4.5317          | 0.7897             | 2.6528          | -287.9914      | -317.8056    | -2.5170         | -2.8242       |
| 0.0481        | 1.25  | 1200 | 0.4191          | -2.6211        | -5.5294          | 0.7817             | 2.9083          | -297.9687      | -325.2281    | -2.5139         | -2.8109       |
| 0.0432        | 1.35  | 1300 | 0.4070          | -2.0605        | -5.0460          | 0.8056             | 2.9855          | -293.1345      | -319.6214    | -2.5153         | -2.8121       |
| 0.0402        | 1.45  | 1400 | 0.4001          | -2.2445        | -5.0942          | 0.7937             | 2.8497          | -293.6164      | -321.4614    | -2.4383         | -2.7388       |
| 0.0529        | 1.56  | 1500 | 0.4066          | -2.3499        | -5.2468          | 0.8016             | 2.8969          | -295.1426      | -322.5153    | -2.3906         | -2.6963       |
| 0.0651        | 1.66  | 1600 | 0.3962          | -2.0597        | -4.8915          | 0.8016             | 2.8318          | -291.5901      | -319.6136    | -2.3390         | -2.6469       |
| 0.0738        | 1.77  | 1700 | 0.3942          | -1.8893        | -4.6107          | 0.8135             | 2.7214          | -288.7817      | -317.9099    | -2.3532         | -2.6607       |
| 0.0597        | 1.87  | 1800 | 0.3990          | -1.8774        | -4.7221          | 0.8175             | 2.8448          | -289.8961      | -317.7905    | -2.2728         | -2.5908       |
| 0.0686        | 1.97  | 1900 | 0.3924          | -1.8745        | -4.6807          | 0.8056             | 2.8062          | -289.4821      | -317.7617    | -2.2554         | -2.5658       |
| 0.0116        | 2.08  | 2000 | 0.4260          | -2.4687        | -5.7190          | 0.7937             | 3.2503          | -299.8647      | -323.7037    | -2.2297         | -2.5347       |
| 0.0114        | 2.18  | 2100 | 0.4519          | -2.8266        | -6.3706          | 0.7976             | 3.5440          | -306.3802      | -327.2823    | -2.2185         | -2.5219       |
| 0.0073        | 2.28  | 2200 | 0.4563          | -2.9422        | -6.5564          | 0.8016             | 3.6142          | -308.2384      | -328.4384    | -2.2103         | -2.5126       |
| 0.0094        | 2.39  | 2300 | 0.4636          | -3.3246        | -7.0542          | 0.8016             | 3.7296          | -313.2165      | -332.2628    | -2.2059         | -2.5081       |
| 0.0056        | 2.49  | 2400 | 0.4745          | -3.3599        | -7.1652          | 0.7976             | 3.8053          | -314.3266      | -332.6161    | -2.1945         | -2.4943       |
| 0.0052        | 2.6   | 2500 | 0.4812          | -3.4916        | -7.3391          | 0.7976             | 3.8475          | -316.0656      | -333.9322    | -2.1888         | -2.4881       |
| 0.0065        | 2.7   | 2600 | 0.4678          | -3.2226        | -6.9887          | 0.7976             | 3.7661          | -312.5613      | -331.2425    | -2.1644         | -2.4560       |
| 0.0059        | 2.8   | 2700 | 0.4694          | -3.4307        | -7.2484          | 0.7976             | 3.8177          | -315.1584      | -333.3234    | -2.1572         | -2.4483       |
| 0.0054        | 2.91  | 2800 | 0.4707          | -3.4959        | -7.3283          | 0.8056             | 3.8324          | -315.9576      | -333.9758    | -2.1575         | -2.4491       |

### Framework versions

- Transformers 4.35.0
- Pytorch 2.1.1+cu121
- Datasets 2.14.6
- Tokenizers 0.14.1

### Evaluation during Training

- Loss: 0.4730
- Rewards/chosen: -3.5289
- Rewards/rejected: -7.3700
- Rewards/accuracies: 0.8016
- Rewards/margins: 3.8412
- Logps/rejected: -316.3751
- Logps/chosen: -334.3053
- Logits/rejected: -2.1644
- Logits/chosen: -2.4556