File size: 11,658 Bytes
cb7a0bc
 
 
06dbf61
cb7a0bc
 
 
 
 
06dbf61
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
cb7a0bc
92329e6
cb7a0bc
adb769e
70922ab
92329e6
a4d0d79
92329e6
cb7a0bc
8425fdc
 
 
 
 
70922ab
476767c
a93afef
476767c
 
 
 
 
 
 
 
 
 
 
4f21e9e
476767c
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
467ec5c
 
 
476767c
 
467ec5c
 
 
 
 
 
 
 
 
 
 
 
 
 
476767c
 
 
 
467ec5c
 
 
 
476767c
 
 
 
 
 
 
 
 
adb769e
467ec5c
476767c
467ec5c
 
 
 
 
adb769e
 
 
 
 
 
467ec5c
71e12be
 
adb769e
 
 
 
 
 
71e12be
e4eedc9
 
 
467ec5c
 
 
 
 
 
 
 
 
06dbf61
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
---
language:
- en
license: apache-2.0
tags:
- distilabel
- dpo
- rlaif
- rlhf
datasets:
- argilla/distilabel-intel-orca-dpo-pairs
model-index:
- name: distilabeled-Hermes-2.5-Mistral-7B
  results:
  - task:
      type: text-generation
      name: Text Generation
    dataset:
      name: AI2 Reasoning Challenge (25-Shot)
      type: ai2_arc
      config: ARC-Challenge
      split: test
      args:
        num_few_shot: 25
    metrics:
    - type: acc_norm
      value: 66.3
      name: normalized accuracy
    source:
      url: https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard?query=argilla/distilabeled-Hermes-2.5-Mistral-7B
      name: Open LLM Leaderboard
  - task:
      type: text-generation
      name: Text Generation
    dataset:
      name: HellaSwag (10-Shot)
      type: hellaswag
      split: validation
      args:
        num_few_shot: 10
    metrics:
    - type: acc_norm
      value: 85.15
      name: normalized accuracy
    source:
      url: https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard?query=argilla/distilabeled-Hermes-2.5-Mistral-7B
      name: Open LLM Leaderboard
  - task:
      type: text-generation
      name: Text Generation
    dataset:
      name: MMLU (5-Shot)
      type: cais/mmlu
      config: all
      split: test
      args:
        num_few_shot: 5
    metrics:
    - type: acc
      value: 63.5
      name: accuracy
    source:
      url: https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard?query=argilla/distilabeled-Hermes-2.5-Mistral-7B
      name: Open LLM Leaderboard
  - task:
      type: text-generation
      name: Text Generation
    dataset:
      name: TruthfulQA (0-shot)
      type: truthful_qa
      config: multiple_choice
      split: validation
      args:
        num_few_shot: 0
    metrics:
    - type: mc2
      value: 55.75
    source:
      url: https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard?query=argilla/distilabeled-Hermes-2.5-Mistral-7B
      name: Open LLM Leaderboard
  - task:
      type: text-generation
      name: Text Generation
    dataset:
      name: Winogrande (5-shot)
      type: winogrande
      config: winogrande_xl
      split: validation
      args:
        num_few_shot: 5
    metrics:
    - type: acc
      value: 78.93
      name: accuracy
    source:
      url: https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard?query=argilla/distilabeled-Hermes-2.5-Mistral-7B
      name: Open LLM Leaderboard
  - task:
      type: text-generation
      name: Text Generation
    dataset:
      name: GSM8k (5-shot)
      type: gsm8k
      config: main
      split: test
      args:
        num_few_shot: 5
    metrics:
    - type: acc
      value: 60.88
      name: accuracy
    source:
      url: https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard?query=argilla/distilabeled-Hermes-2.5-Mistral-7B
      name: Open LLM Leaderboard
---
# ⚗️ distilabeled OpenHermes 2.5 Mistral 7B

> A Neural DPO of OpenHermes 2.5, high quality matters for DPO!

<div>
    <img src="https://cdn-uploads.huggingface.co/production/uploads/60420dccc15e823a685f2b03/yWdvBtKKfJdpdnPiSlNb9.png">
</div>

<p align="center">
  <a href="https://github.com/argilla-io/distilabel">
    <img src="https://raw.githubusercontent.com/argilla-io/distilabel/main/docs/assets/distilabel-badge-light.png" alt="Built with Distilabel" width="200" height="32"/>
  </a>
</p>

## Introduction
This model is the virtual launching partner of our new open dataset [argilla/distilabel-intel-orca-dpo-pairs](https://huggingface.co/datasets/argilla/distilabel-intel-orca-dpo-pairs). It's a DPO fine tune of [OpenHermes-2.5-Mistral-7B](https://huggingface.co/teknium/OpenHermes-2.5-Mistral-7B). It outperforms the awesome `mlabonne/NeuralHermes-2.5-Mistral-7B` with the **exact same DPO recipe but using our new orca-pairs dataset**. 

The dataset is a "distilabeled" version of the widely used dataset: [Intel/orca_dpo_pairs](https://huggingface.co/datasets/Intel/orca_dpo_pairs). The original dataset has been used by 100s of open source practitioners and models. We knew from fixing UltraFeedback (and before that, Alpacas and Dollys) that this dataset could be highly improved.

Continuing with our mission to build the best alignment datasets for open source LLMs and the community, we spent a few hours to improve it with [distilabel](https://github.com/argilla-io/distilabel). 

The main intuition was: the original dataset just assumes gpt4/3.5-turbo are always the best response. We know from UltraFeedback that's not always the case. Moreover, DPO fine-tuning benefits from diversity of preference pairs. 

This is what it took to build a real preference dataset with distilabel:

```python
from distilabel.llm import OpenAILLM
from distilabel.tasks import JudgeLMTask
from distilabel.pipeline import Pipeline

from datasets import load_dataset

dataset = load_dataset("Intel/orca_dpo_pairs", split="train")

# this shuffles the pairs to mitigate positional bias
dataset = dataset.map(lambda x: shuffle_and_track(x["chosen"], x["rejected"]))

# we use our JudgeLM implementation to rate the original pairs
labeler = OpenAILLM(
    task=JudgeLMTask(),
    model="gpt-4-1106-preview",
    num_threads=16,
    max_new_tokens=512,
)

dataset = dataset.rename_columns({"question": "input"})

distipipe = Pipeline(
    labeller=labeler
)

# this computes ratings and natural language critiques for each pair
ds = distipipe.generate(dataset=dataset, num_generations=2)
```
The resulting dataset is now much more useful: we know which response is preferred (by gpt-4-turbo), which ones have low scores, and we even have natural language explanations. But what did we find? Was our intuition confirmed?

![image/png](https://cdn-uploads.huggingface.co/production/uploads/60420dccc15e823a685f2b03/-V8wY1DYzrtwM9LbGrBXq.png)

The above chart shows the following: 

* ~4,000 pairs were given the same rating (a tie).
* ~7,000 pairs were correct according to our AI judge (`unchanged`).
* and ~2,000 times the rejected response was preferred (`swapped`).

Now the next question is: can we build better models with this new knowledge? The answer is "distilabeled Hermes" so let's get back to the model!

> If you love datasets as much as we do, check the [dataset](https://huggingface.co/datasets/argilla/distilabel-intel-orca-dpo-pairs) and share it with your friends and colleagues.

## Training details
As we did with [Notus](https://argilla.io/blog/notus7b/), we wanted a reproducible recipe to test the impact of data quality. 

And we're lucky to have so many amazing folks in the open community contributing reproducible, easy-to-use training scripts and recipes. This time, [Maxime Labonne](https://twitter.com/maximelabonne) had shared a [Colab](https://colab.research.google.com/drive/15iFBr1xWgztXvhrj5I9fBv20c7CFOPBE?usp=sharing) to fine-tune OpenHermes with DPO and the original Intel's dataset, perfect! (funnily enough this exact recipe has been used recently to fine-tune the [top ranked 7B model](https://huggingface.co/CultriX/MistralTrix-v1)).

And that's all for the model part: we reused a good, reproducible recipe. 

Once we had created the dataset, the training data part is also kind of boring: we just filtered the samples based on our intuition and with the goal of reducing the dataset size:

* Ties probably won't help the DPO tuning to learn something meaningful: both responses are similarly good or bad (filter out `ties`)
* Very good chosen responses will steer the model to generate good responses (score of chosen response >=8)

Additionally, we did some "decontamination" of gsm8k prompts (very few that were present in the train split of gsm8k).

In code, using our new dataset this translates into:

```python
from datasets import load_dataset

# Instead of this:
# dataset = load_dataset("Intel/orca_dpo_pairs", split="train")

# we did this
dataset = load_dataset("argilla/distilabel-intel-orca-dpo-pairs", split="train")

dataset = dataset.filter(
    lambda r: 
        r["status"] != "tie" and 
        r["chosen_score"] >= 8 and 
        not r["in_gsm8k_train"]
)
```
This resulted in `5,922` instead of `12,859` samples (54% reduction) and we run it for 200 steps (using around ~3.2K samples).

## Benchmark results
For benchmarking we used the famous "Nous" or "Teknium" benchmark. You can find below an overview, including our first experiment with a less ambitious dataset filtering (removing ties and `score>5`).

For running the benchmark we used another awesome contribution from Maxime: [LLM AutoEval](https://github.com/mlabonne/llm-autoeval), check it out!


|                                                      Model                                                        | AGIEval | GPT4All | TruthfulQA | Bigbench | Average | 
|-------------------------------------------------------------------------------------------------------------------|--------:|--------:|-----------:|---------:|--------:|
| [argilla/distilabeled-Hermes-2.5-Mistral-7B](https://huggingface.co/argilla/distilabeled-Hermes-2.5-Mistral-7B)   |   **44.64** |   **73.35** |      55.96 |    42.21 |   **54.04** |   
| [dvilasuero/NeuralHermes-2.5-Mistral-7B-distilabel](https://huggingface.co/dvilasuero/NeuralHermes-2.5-Mistral-7B-distilabel) (first experiment) |   44.27 |    73.3 |      **56.26** |    **42.25** |   54.02 |   
| mlabonne/NeuralHermes-2.5-Mistral-7B (original recipe)                                                                   |   43.67 |   73.24 |      55.37 |    41.76 |   53.51 |   
| teknium/OpenHermes-2.5-Mistral-7B                                                                                 |   42.75  |   72.99 |    52.99  |   40.94  |    52.42| 

> Update: we now include llm-harness results too!
 
| Model | ARC | HellaSwag | MMLU | TruthfulQA | Winogrande | GSM8K | 
|------------------------------------------------------|-------|-----------|------|-----------:|------------|-------|
| [argilla/distilabeled-Hermes-2.5-Mistral-7B](https://huggingface.co/argilla/distilabeled-Hermes-2.5-Mistral-7B) | 66.04 | **85.07** | Pending | 55.96 | **79.56** | **66.34** |
| [dvilasuero/NeuralHermes-2.5-Mistral-7B-distilabel](https://huggingface.co/dvilasuero/NeuralHermes-2.5-Mistral-7B-distilabel) | 65.36 | 84.74 | Pending | **56.26** | 79.24 | 65.13 |
| [mlabonne/NeuralHermes-2.5-Mistral-7B](https://huggingface.co/mlabonne/NeuralHermes-2.5-Mistral-7B) | **66.55** | 84.90 | **63.32** | 54.93 | 78.30 | 61.30 | 
| [teknium/OpenHermes-2.5-Mistral-7B](https://huggingface.co/teknium/OpenHermes-2.5-Mistral-7B) | 64.93 | 84.18 | 63.64 | 52.24 | 78.06 | 26.08 | 

### Training Hardware

We used 1 x A100 40GB in runpod for less than 1 hour.

## Acknowledgements

We'd like to thank the amazing open community and in particular:

* The Intel team for publishing a great open dataset and show how well it worked in the first place 
* Teknium and NousResearch for their awesome work and models.
* Maxime for sharing such great resources.


# [Open LLM Leaderboard Evaluation Results](https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard)
Detailed results can be found [here](https://huggingface.co/datasets/open-llm-leaderboard/details_argilla__distilabeled-Hermes-2.5-Mistral-7B)

|             Metric              |Value|
|---------------------------------|----:|
|Avg.                             |68.42|
|AI2 Reasoning Challenge (25-Shot)|66.30|
|HellaSwag (10-Shot)              |85.15|
|MMLU (5-Shot)                    |63.50|
|TruthfulQA (0-shot)              |55.75|
|Winogrande (5-shot)              |78.93|
|GSM8k (5-shot)                   |60.88|