File size: 4,420 Bytes
d8607aa
 
 
2144467
f4b8032
d8607aa
 
 
4856a37
2144467
 
 
 
 
 
d8607aa
 
 
 
21eefb4
 
 
 
16f4d77
 
 
 
 
 
 
 
 
 
 
 
8749fa0
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
16f4d77
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
---
language:
- en
license: apache-2.0
library_name: transformers
tags:
- orpo
- qlora
- trl
datasets:
- alvarobartt/dpo-mix-7k-simplified
- argilla/dpo-mix-7k
base_model: mistralai/Mistral-7B-v0.1
pipeline_tag: text-generation
inference: false
---

## ORPO fine-tune of Mistral 7B v0.1 with DPO Mix 7K

![image/jpeg](https://cdn-uploads.huggingface.co/production/uploads/60f0608166e5701b80ed3f02/hRyhnTySt-KQ0gnnoclSm.jpeg)

> Stable Diffusion XL "A capybara, a killer whale, and a robot named Ultra being friends"

This is an ORPO fine-tune of [`mistralai/Mistral-7B-v0.1`](https://huggingface.co/mistralai/Mistral-7B-v0.1) with
[`alvarobartt/dpo-mix-7k-simplified`](https://huggingface.co/datasets/alvarobartt/dpo-mix-7k-simplified).

โš ๏ธ Note that the code is still experimental, as the `ORPOTrainer` PR is still not merged, follow its progress
at [๐Ÿค—`trl` - `ORPOTrainer` PR](https://github.com/huggingface/trl/pull/1435).

## About the fine-tuning

In order to fine-tune [`mistralai/Mistral-7B-v0.1`](https://huggingface.co/mistralai/Mistral-7B-v0.1) using ORPO, the branch
`orpo` from [๐Ÿค—`trl`](https://github.com/huggingface/trl) has been used, thanks to the invaluable and quick contribution of
@kashif.

ORPO stands for Odds Ratio Preference Optimization, and defines a new paradigm on fine-tuning LLMs, โ€œcombiningโ€ both the SFT
and the PPO/DPO stage into a single stage, thanks to the proposed loss function starting off from a preference dataset i.e.
chosen-rejected pairs.

Some key features about ORPO:
- โšก๏ธ Faster to train as itโ€™s now a single stage fine-tuning
- ๐Ÿ‘จ๐Ÿปโ€๐Ÿซ Requires preference data i.e. (prompt, chosen, rejected)-like datasets
- โฌ‡๏ธ Less memory than PPO/DPO as doesnโ€™t need a reference model
- ๐Ÿ† SOTA results for Phi-2 (2.7B), Llama-2 (7B), and Mistral (7B) when fine-tuned using single-turn UltraFeedback

Some notes on the experiments mentioned in the paper:
- ๐Ÿ“Œ Up to 7B parameter LLMs were fine-tuned, achieving better performance compared to 7B counterparts and even 13B LLMs
- ๐Ÿ“Œ Not yet trained with multi-turn datasets as Capybara (may be an interesting experiment to run)
- ๐Ÿ“Œ For OPT models fine-tuned with HH-RLHF from Anthropic, truncated and padded to 1024 tokens, filtering out filtering the prompts with > 1024 tokens
- ๐Ÿ“Œ For Phi-2, Mistral (7B) and Llama 2 (7B), or UltraFeedback from OpenBMB (truncated and padded to 2048 tokens), filtering out filtering the prompts with > 1024 tokens
- ๐Ÿ“Œ Fine-tuned for 10 epochs, and using the evaluation loss as the metric for selecting the best models

For more information about ORPO, I highly recommend reading their paper titled [`ORPO: Monolithic Preference Optimization without Reference Model`](https://huggingface.co/papers/2403.07691),
as it contains a lot of information and details not only on the ORPO method, but also on the experiment they ran, the results they got, and much more.

๐Ÿ“… Fine-tuning code will be shared soon, stay tuned!

## About the dataset

The dataset used for this fine-tune is [`alvarobartt/dpo-mix-7k-simplified`](https://huggingface.co/datasets/alvarobartt/dpo-mix-7k-simplified),
which is a simplified version of [`argilla/dpo-mix-7k`](https://huggingface.co/datasets/argilla/dpo-mix-7k).

The simplification comes from the fact that the `prompt` column is detached from both the `chosen` and `rejected`
columns so that there's no need for extra pre-processing while applying the chat template to the dataset before the
fine-tuning. So on, the dataset remains as is, with an additional column for the `prompt`.

The dataset is a small cocktail combining Argilla's latest efforts on DPO datasets, mixing the following datasets:

* [`argilla/distilabel-capybara-dpo-7k-binarized`](https://huggingface.co/datasets/argilla/distilabel-capybara-dpo-7k-binarized)
* [`argilla/distilabel-intel-orca-dpo-pairs`](https://huggingface.co/datasets/argilla/distilabel-intel-orca-dpo-pairs)
* [`argilla/ultrafeedback-binarized-preferences-cleaned`](https://huggingface.co/datasets/argilla/ultrafeedback-binarized-preferences-cleaned)

The samples have been randomly selected from the original datasets with a proportion of 0.33 each, as can be seen via the `dataset` column of the dataset.

For more information about the original dataset check [the `README.md` file of `argilla/dpo-mix-7k`](https://huggingface.co/datasets/argilla/dpo-mix-7k/blob/main/README.md).