cstr commited on
Commit
d803e15
1 Parent(s): d012cb1

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +183 -63
README.md CHANGED
@@ -1,63 +1,183 @@
1
- # **ORPO**
2
-
3
- ### **`Updates (24.03.25)`**
4
- - [X] Sample script for ORPOTrainer in 🤗<a class="link" href="https://github.com/huggingface/trl">TRL</a> is added to `trl/test_orpo_trainer_demo.py`
5
- - [X] New model, 🤗<a class="link" href="https://huggingface.co/kaist-ai/mistral-orpo-capybara-7k">kaist-ai/mistral-orpo-capybara-7k</a>, is added to 🤗<a class="link" href="https://huggingface.co/collections/kaist-ai/orpo-65efef87544ba100aef30013">ORPO Collection</a>
6
- - [X] Now you can try ORPO in 🤗<a class="link" href="https://github.com/huggingface/trl">TRL</a> and <a class="link" href="https://github.com/OpenAccess-AI-Collective/axolotl">Axolotl</a>🔥
7
- - [X] We are making general guideline for training LLMs with ORPO, stay tuned🔥
8
- - [X] **Mistral-ORPO-β** achieved a 14.7% in the length-controlled (LC) win rate on <a class="link" href="https://tatsu-lab.github.io/alpaca_eval/">official AlpacaEval Leaderboard</a>🔥
9
-
10
- &nbsp;
11
-
12
- This is the official repository for <a class="link" href="https://arxiv.org/abs/2403.07691">**ORPO: Monolithic Preference Optimization without Reference Model**</a>. The detailed results in the paper can be found in:
13
- - [Open LLM Leaderboard](https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard?query=kaist-ai%2Fmistral-orpo-beta)
14
- - [AlpacaEval](#alpacaeval)
15
- - [MT-Bench](#mt-bench)
16
- - [IFEval](#ifeval)
17
-
18
-
19
- ### **`Model Checkpoints`**
20
-
21
- Our models trained with ORPO can be found in:
22
-
23
- - [X] **Mistral-ORPO-Capybara-7k**: 🤗 <a class="link" href="https://huggingface.co/kaist-ai/mistral-orpo-capybara-7k">kaist-ai/mistral-orpo-capybara-7k</a>
24
- - [X] **Mistral-ORPO-⍺**: 🤗 <a class="link" href="https://huggingface.co/kaist-ai/mistral-orpo-alpha">kaist-ai/mistral-orpo-alpha</a>
25
- - [X] **Mistral-ORPO-β**: 🤗 <a class="link" href="https://huggingface.co/kaist-ai/mistral-orpo-beta">kaist-ai/mistral-orpo-beta</a>
26
-
27
- And the corresponding logs for the average log probabilities of chosen/rejected responses during training are reported in:
28
-
29
- - [X] **Mistral-ORPO-Capybara-7k**: TBU
30
- - [X] **Mistral-ORPO-⍺**: <a class="link" href="https://wandb.ai/jiwooya1000/PREF/reports/Mistral-ORPO-7B-Training-Log--Vmlldzo3MTE1NzE0?accessToken=rms6o4mg5vo3feu1bvbpk632m4cspe19l0u1p4he3othx5bgean82chn9neiile6">Wandb Report for Mistral-ORPO-⍺</a>
31
- - [X] **Mistral-ORPO-β**: <a class="link" href="https://wandb.ai/jiwooya1000/PREF/reports/Mistral-ORPO-7B-Training-Log--Vmlldzo3MTE3MzMy?accessToken=dij4qbp6dcrofsanzbgobjsne9el8a2zkly2u5z82rxisd4wiwv1rhp0s2dub11e">Wandb Report for Mistral-ORPO-β</a>
32
-
33
- &nbsp;
34
-
35
- ### **`AlpacaEval`**
36
-
37
- <figure>
38
- <img class="png" src="/assets/img/alpaca_blog.png" alt="Description of the image">
39
- <figcaption><b>Figure 1.</b> AlpacaEval 2.0 score for the models trained with different alignment methods.</figcaption>
40
- </figure>
41
-
42
- &nbsp;
43
-
44
- ### **`MT-Bench`**
45
-
46
- <figure>
47
- <img class="png" src="/assets/img/mtbench_hf.png" alt="Description of the image">
48
- <figcaption><b>Figure 2.</b> MT-Bench result by category.</figcaption>
49
- </figure>
50
-
51
- &nbsp;
52
-
53
- ### **`IFEval`**
54
-
55
- IFEval scores are measured with <a class="link" href="https://github.com/EleutherAI/lm-evaluation-harness">EleutherAI/lm-evaluation-harness</a> by applying the chat template. The scores for Llama-2-Chat (70B), Zephyr-β (7B), and Mixtral-8X7B-Instruct-v0.1 are originally reported in <a class="link" href="https://twitter.com/wiskojo/status/1739767758462877823">this tweet</a>.
56
-
57
- | **Model Type** | **Prompt-Strict** | **Prompt-Loose** | **Inst-Strict** | **Inst-Loose** |
58
- |--------------------|:-----------------:|:----------------:|:---------------:|----------------|
59
- | **Llama-2-Chat (70B)** | 0.4436 | 0.5342 | 0.5468 | 0.6319 |
60
- | **Zephyr-β (7B)** | 0.4233 | 0.4547 | 0.5492 | 0.5767 |
61
- | **Mixtral-8X7B-Instruct-v0.1** | 0.5213 | **0.5712** | 0.6343 | **0.6823** |
62
- | **Mistral-ORPO-⍺ (7B)** | 0.5009 | 0.5083 | 0.5995 | 0.6163 |
63
- | **Mistral-ORPO-β (7B)** | **0.5287** | 0.5564 | **0.6355** | 0.6619 |
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ tags:
3
+ - merge
4
+ - mergekit
5
+ - lazymergekit
6
+ - flemmingmiguel/NeuDist-Ro-7B
7
+ - johannhartmann/Brezn3
8
+ - ResplendentAI/Flora_DPO_7B
9
+ base_model:
10
+ - flemmingmiguel/NeuDist-Ro-7B
11
+ - johannhartmann/Brezn3
12
+ - ResplendentAI/Flora_DPO_7B
13
+ language:
14
+ - de
15
+ - en
16
+ ---
17
+
18
+ # Spaetzle-v8-7b
19
+
20
+ This model is supposed to show adequate performance in German and English on a number of tasks, while mostly behaving well, that is, without rambling on, intermixing tokens from different templates in training and adapting, etc.
21
+
22
+ It is mostly a quick test, and considerably weaker in German grammar and orthography than DiscoLM e.g., but for use cases where this is not too important, but e.g. instruction following, reasoning, etc, it might actually be a little bit preferable.
23
+
24
+ It is a merge of the following models using [LazyMergekit](https://colab.research.google.com/drive/1obulZ1ROXHjYLn6PPZJwRR6GzgQogxxb?usp=sharing):
25
+ * [flemmingmiguel/NeuDist-Ro-7B](https://huggingface.co/flemmingmiguel/NeuDist-Ro-7B)
26
+ * [johannhartmann/Brezn3](https://huggingface.co/johannhartmann/Brezn3)
27
+ * [ResplendentAI/Flora_DPO_7B](https://huggingface.co/ResplendentAI/Flora_DPO_7B)
28
+ * on the basis of [mayflowergmbh/Wiedervereinigung-7b-dpo-laser](https://huggingface.co/mayflowergmbh/Wiedervereinigung-7b-dpo-laser)
29
+
30
+ All credits are due to the creators of those original models and the training datasets involved.
31
+
32
+ For a suitable quantized version, try [cstr/Spaetzle-v8-7b-GGUF](https://huggingface.co/cstr/Spaetzle-v8-7b-GGUF)
33
+
34
+
35
+ ## Evaluation
36
+ [Open LLM Leaderboard Evaluation Results](https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard)
37
+ Detailed results can be found [here](https://huggingface.co/datasets/open-llm-leaderboard/details_cstr__Spaetzle-v8-7b)
38
+
39
+ | Metric |Value|
40
+ |---------------------------------|----:|
41
+ |Avg. |72.27|
42
+ |AI2 Reasoning Challenge (25-Shot)|68.69|
43
+ |HellaSwag (10-Shot) |86.68|
44
+ |MMLU (5-Shot) |64.60|
45
+ |TruthfulQA (0-shot) |64.05|
46
+ |Winogrande (5-shot) |81.45|
47
+ |GSM8k (5-shot) |68.16|
48
+
49
+ EQ-Bench (v2_de): 61.04 / english (v2): 78.3
50
+
51
+ | Model |AGIEval|GPT4All|TruthfulQA|Bigbench|Average|
52
+ |------------------------------------------------------------|------:|------:|---------:|-------:|------:|
53
+ |[Spaetzle-v8-7b](https://huggingface.co/cstr/Spaetzle-v8-7b)| 45.31| 75.69| 63.94| 45.57| 57.63|
54
+
55
+ ### AGIEval
56
+ | Task |Version| Metric |Value| |Stderr|
57
+ |------------------------------|------:|--------|----:|---|-----:|
58
+ |agieval_aqua_rat | 0|acc |25.59|± | 2.74|
59
+ | | |acc_norm|24.80|± | 2.72|
60
+ |agieval_logiqa_en | 0|acc |39.63|± | 1.92|
61
+ | | |acc_norm|39.78|± | 1.92|
62
+ |agieval_lsat_ar | 0|acc |23.48|± | 2.80|
63
+ | | |acc_norm|24.35|± | 2.84|
64
+ |agieval_lsat_lr | 0|acc |50.98|± | 2.22|
65
+ | | |acc_norm|51.96|± | 2.21|
66
+ |agieval_lsat_rc | 0|acc |62.08|± | 2.96|
67
+ | | |acc_norm|62.83|± | 2.95|
68
+ |agieval_sat_en | 0|acc |78.64|± | 2.86|
69
+ | | |acc_norm|79.13|± | 2.84|
70
+ |agieval_sat_en_without_passage| 0|acc |44.66|± | 3.47|
71
+ | | |acc_norm|44.66|± | 3.47|
72
+ |agieval_sat_math | 0|acc |37.27|± | 3.27|
73
+ | | |acc_norm|35.00|± | 3.22|
74
+
75
+ Average: 45.31%
76
+
77
+ ### GPT4All
78
+ | Task |Version| Metric |Value| |Stderr|
79
+ |-------------|------:|--------|----:|---|-----:|
80
+ |arc_challenge| 0|acc |63.14|± | 1.41|
81
+ | | |acc_norm|64.51|± | 1.40|
82
+ |arc_easy | 0|acc |85.98|± | 0.71|
83
+ | | |acc_norm|82.49|± | 0.78|
84
+ |boolq | 1|acc |88.10|± | 0.57|
85
+ |hellaswag | 0|acc |66.31|± | 0.47|
86
+ | | |acc_norm|85.17|± | 0.35|
87
+ |openbookqa | 0|acc |38.00|± | 2.17|
88
+ | | |acc_norm|47.20|± | 2.23|
89
+ |piqa | 0|acc |83.35|± | 0.87|
90
+ | | |acc_norm|84.17|± | 0.85|
91
+ |winogrande | 0|acc |78.22|± | 1.16|
92
+
93
+ Average: 75.69%
94
+
95
+ ### TruthfulQA
96
+ | Task |Version|Metric|Value| |Stderr|
97
+ |-------------|------:|------|----:|---|-----:|
98
+ |truthfulqa_mc| 1|mc1 |47.74|± | 1.75|
99
+ | | |mc2 |63.94|± | 1.53|
100
+
101
+ Average: 63.94%
102
+
103
+ ### Bigbench
104
+ | Task |Version| Metric |Value| |Stderr|
105
+ |------------------------------------------------|------:|---------------------|----:|---|-----:|
106
+ |bigbench_causal_judgement | 0|multiple_choice_grade|56.84|± | 3.60|
107
+ |bigbench_date_understanding | 0|multiple_choice_grade|66.12|± | 2.47|
108
+ |bigbench_disambiguation_qa | 0|multiple_choice_grade|41.47|± | 3.07|
109
+ |bigbench_geometric_shapes | 0|multiple_choice_grade|22.01|± | 2.19|
110
+ | | |exact_str_match | 0.00|± | 0.00|
111
+ |bigbench_logical_deduction_five_objects | 0|multiple_choice_grade|31.40|± | 2.08|
112
+ |bigbench_logical_deduction_seven_objects | 0|multiple_choice_grade|23.14|± | 1.60|
113
+ |bigbench_logical_deduction_three_objects | 0|multiple_choice_grade|56.00|± | 2.87|
114
+ |bigbench_movie_recommendation | 0|multiple_choice_grade|45.00|± | 2.23|
115
+ |bigbench_navigate | 0|multiple_choice_grade|50.70|± | 1.58|
116
+ |bigbench_reasoning_about_colored_objects | 0|multiple_choice_grade|70.05|± | 1.02|
117
+ |bigbench_ruin_names | 0|multiple_choice_grade|45.54|± | 2.36|
118
+ |bigbench_salient_translation_error_detection | 0|multiple_choice_grade|26.05|± | 1.39|
119
+ |bigbench_snarks | 0|multiple_choice_grade|71.82|± | 3.35|
120
+ |bigbench_sports_understanding | 0|multiple_choice_grade|72.92|± | 1.42|
121
+ |bigbench_temporal_sequences | 0|multiple_choice_grade|44.20|± | 1.57|
122
+ |bigbench_tracking_shuffled_objects_five_objects | 0|multiple_choice_grade|22.80|± | 1.19|
123
+ |bigbench_tracking_shuffled_objects_seven_objects| 0|multiple_choice_grade|18.23|± | 0.92|
124
+ |bigbench_tracking_shuffled_objects_three_objects| 0|multiple_choice_grade|56.00|± | 2.87|
125
+
126
+ Average: 45.57%
127
+
128
+ Average score: 57.63%
129
+
130
+ ## 💻 Usage
131
+
132
+ ```python
133
+ !pip install -qU transformers accelerate
134
+
135
+ from transformers import AutoTokenizer
136
+ import transformers
137
+ import torch
138
+
139
+ model = "cstr/Spaetzle-v8-7b"
140
+ messages = [{"role": "user", "content": "What is a large language model?"}]
141
+
142
+ tokenizer = AutoTokenizer.from_pretrained(model)
143
+ prompt = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
144
+ pipeline = transformers.pipeline(
145
+ "text-generation",
146
+ model=model,
147
+ torch_dtype=torch.float16,
148
+ device_map="auto",
149
+ )
150
+
151
+ outputs = pipeline(prompt, max_new_tokens=256, do_sample=True, temperature=0.7, top_k=50, top_p=0.95)
152
+ print(outputs[0]["generated_text"])
153
+ ```
154
+
155
+
156
+ ## 🧩 Configuration
157
+
158
+ The model uses ChatML and should work well with this (as it is merged from models which (mostly) saw ChatML templates in training).
159
+
160
+ ```yaml
161
+ models:
162
+ - model: mayflowergmbh/Wiedervereinigung-7b-dpo-laser
163
+ # no parameters necessary for base model
164
+ - model: flemmingmiguel/NeuDist-Ro-7B
165
+ parameters:
166
+ density: 0.60
167
+ weight: 0.30
168
+ - model: johannhartmann/Brezn3
169
+ parameters:
170
+ density: 0.65
171
+ weight: 0.40
172
+ - model: ResplendentAI/Flora_DPO_7B
173
+ parameters:
174
+ density: 0.6
175
+ weight: 0.3
176
+ merge_method: dare_ties
177
+ base_model: mayflowergmbh/Wiedervereinigung-7b-dpo-laser
178
+ parameters:
179
+ int8_mask: true
180
+ dtype: bfloat16
181
+ random_seed: 0
182
+ tokenizer_source: base
183
+ ```