Update README.md

Browse files

Files changed (1) hide show

README.md +183 -63

README.md CHANGED Viewed

@@ -1,63 +1,183 @@
-# **ORPO**
-### **`Updates (24.03.25)`**
-- [X] Sample script for ORPOTrainer in 🤗<a class="link" href="https://github.com/huggingface/trl">TRL</a> is added to `trl/test_orpo_trainer_demo.py`
-- [X] New model, 🤗<a class="link" href="https://huggingface.co/kaist-ai/mistral-orpo-capybara-7k">kaist-ai/mistral-orpo-capybara-7k</a>, is added to 🤗<a class="link" href="https://huggingface.co/collections/kaist-ai/orpo-65efef87544ba100aef30013">ORPO Collection</a>
-- [X] Now you can try ORPO in 🤗<a class="link" href="https://github.com/huggingface/trl">TRL</a> and <a class="link" href="https://github.com/OpenAccess-AI-Collective/axolotl">Axolotl</a>🔥
-- [X] We are making general guideline for training LLMs with ORPO, stay tuned🔥
-- [X] **Mistral-ORPO-β** achieved a 14.7% in the length-controlled (LC) win rate on <a class="link" href="https://tatsu-lab.github.io/alpaca_eval/">official AlpacaEval Leaderboard</a>🔥
-&nbsp;
-This is the official repository for <a class="link" href="https://arxiv.org/abs/2403.07691">**ORPO: Monolithic Preference Optimization without Reference Model**</a>. The detailed results in the paper can be found in:
-- [Open LLM Leaderboard](https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard?query=kaist-ai%2Fmistral-orpo-beta)
-- [AlpacaEval](#alpacaeval)
-- [MT-Bench](#mt-bench)
-- [IFEval](#ifeval)
-### **`Model Checkpoints`**
-Our models trained with ORPO can be found in:
-- [X] **Mistral-ORPO-Capybara-7k**: 🤗 <a class="link" href="https://huggingface.co/kaist-ai/mistral-orpo-capybara-7k">kaist-ai/mistral-orpo-capybara-7k</a>
-- [X] **Mistral-ORPO-⍺**: 🤗 <a class="link" href="https://huggingface.co/kaist-ai/mistral-orpo-alpha">kaist-ai/mistral-orpo-alpha</a>
-- [X] **Mistral-ORPO-β**: 🤗 <a class="link" href="https://huggingface.co/kaist-ai/mistral-orpo-beta">kaist-ai/mistral-orpo-beta</a>
-And the corresponding logs for the average log probabilities of chosen/rejected responses during training are reported in:
-- [X] **Mistral-ORPO-Capybara-7k**: TBU
-- [X] **Mistral-ORPO-⍺**: <a class="link" href="https://wandb.ai/jiwooya1000/PREF/reports/Mistral-ORPO-7B-Training-Log--Vmlldzo3MTE1NzE0?accessToken=rms6o4mg5vo3feu1bvbpk632m4cspe19l0u1p4he3othx5bgean82chn9neiile6">Wandb Report for Mistral-ORPO-⍺</a>
-- [X] **Mistral-ORPO-β**: <a class="link" href="https://wandb.ai/jiwooya1000/PREF/reports/Mistral-ORPO-7B-Training-Log--Vmlldzo3MTE3MzMy?accessToken=dij4qbp6dcrofsanzbgobjsne9el8a2zkly2u5z82rxisd4wiwv1rhp0s2dub11e">Wandb Report for Mistral-ORPO-β</a>
-&nbsp;
-### **`AlpacaEval`**
-<figure>
-  <img class="png" src="/assets/img/alpaca_blog.png" alt="Description of the image">
-  <figcaption><b>Figure 1.</b> AlpacaEval 2.0 score for the models trained with different alignment methods.</figcaption>
-</figure>
-&nbsp;
-### **`MT-Bench`**
-<figure>
-  <img class="png" src="/assets/img/mtbench_hf.png" alt="Description of the image">
-  <figcaption><b>Figure 2.</b> MT-Bench result by category.</figcaption>
-</figure>
-&nbsp;
-### **`IFEval`**
-IFEval scores are measured with <a class="link" href="https://github.com/EleutherAI/lm-evaluation-harness">EleutherAI/lm-evaluation-harness</a> by applying the chat template. The scores for Llama-2-Chat (70B), Zephyr-β (7B), and Mixtral-8X7B-Instruct-v0.1 are originally reported in <a class="link" href="https://twitter.com/wiskojo/status/1739767758462877823">this tweet</a>.
-| **Model Type**     | **Prompt-Strict** | **Prompt-Loose** | **Inst-Strict** | **Inst-Loose** |
-|--------------------|:-----------------:|:----------------:|:---------------:|----------------|
-| **Llama-2-Chat (70B)** |       0.4436      |      0.5342      |      0.5468     |     0.6319     |
-| **Zephyr-β (7B)** |       0.4233      |      0.4547      |      0.5492     |     0.5767     |
-| **Mixtral-8X7B-Instruct-v0.1** |       0.5213      |      **0.5712**      |      0.6343     |     **0.6823**     |
-| **Mistral-ORPO-⍺ (7B)** |       0.5009      |      0.5083      |      0.5995     |     0.6163     |
-| **Mistral-ORPO-β (7B)** |       **0.5287**      |      0.5564      |      **0.6355**     |     0.6619     |

+---
+tags:
+- merge
+- mergekit
+- lazymergekit
+- flemmingmiguel/NeuDist-Ro-7B
+- johannhartmann/Brezn3
+- ResplendentAI/Flora_DPO_7B
+base_model:
+- flemmingmiguel/NeuDist-Ro-7B
+- johannhartmann/Brezn3
+- ResplendentAI/Flora_DPO_7B
+language:
+- de
+- en
+---
+# Spaetzle-v8-7b
+This model is supposed to show adequate performance in German and English on a number of tasks, while mostly behaving well, that is, without rambling on, intermixing tokens from different templates in training and adapting, etc.
+It is mostly a quick test, and considerably weaker in German grammar and orthography than DiscoLM e.g., but for use cases where this is not too important, but e.g. instruction following, reasoning, etc, it might actually be a little bit preferable.
+It is a merge of the following models using [LazyMergekit](https://colab.research.google.com/drive/1obulZ1ROXHjYLn6PPZJwRR6GzgQogxxb?usp=sharing):
+* [flemmingmiguel/NeuDist-Ro-7B](https://huggingface.co/flemmingmiguel/NeuDist-Ro-7B)
+* [johannhartmann/Brezn3](https://huggingface.co/johannhartmann/Brezn3)
+* [ResplendentAI/Flora_DPO_7B](https://huggingface.co/ResplendentAI/Flora_DPO_7B)
+* on the basis of [mayflowergmbh/Wiedervereinigung-7b-dpo-laser](https://huggingface.co/mayflowergmbh/Wiedervereinigung-7b-dpo-laser)
+All credits are due to the creators of those original models and the training datasets involved.
+For a suitable quantized version, try [cstr/Spaetzle-v8-7b-GGUF](https://huggingface.co/cstr/Spaetzle-v8-7b-GGUF)
+## Evaluation
+[Open LLM Leaderboard Evaluation Results](https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard)
+Detailed results can be found [here](https://huggingface.co/datasets/open-llm-leaderboard/details_cstr__Spaetzle-v8-7b)
+|             Metric              |Value|
+|---------------------------------|----:|
+|Avg.                             |72.27|
+|AI2 Reasoning Challenge (25-Shot)|68.69|
+|HellaSwag (10-Shot)              |86.68|
+|MMLU (5-Shot)                    |64.60|
+|TruthfulQA (0-shot)              |64.05|
+|Winogrande (5-shot)              |81.45|
+|GSM8k (5-shot)                   |68.16|
+EQ-Bench (v2_de): 61.04 / english (v2): 78.3
+|                           Model                            |AGIEval|GPT4All|TruthfulQA|Bigbench|Average|
+|------------------------------------------------------------|------:|------:|---------:|-------:|------:|
+|[Spaetzle-v8-7b](https://huggingface.co/cstr/Spaetzle-v8-7b)|  45.31|  75.69|     63.94|   45.57|  57.63|
+### AGIEval
+|             Task             |Version| Metric |Value|   |Stderr|
+|------------------------------|------:|--------|----:|---|-----:|
+|agieval_aqua_rat              |      0|acc     |25.59|±  |  2.74|
+|                              |       |acc_norm|24.80|±  |  2.72|
+|agieval_logiqa_en             |      0|acc     |39.63|±  |  1.92|
+|                              |       |acc_norm|39.78|±  |  1.92|
+|agieval_lsat_ar               |      0|acc     |23.48|±  |  2.80|
+|                              |       |acc_norm|24.35|±  |  2.84|
+|agieval_lsat_lr               |      0|acc     |50.98|±  |  2.22|
+|                              |       |acc_norm|51.96|±  |  2.21|
+|agieval_lsat_rc               |      0|acc     |62.08|±  |  2.96|
+|                              |       |acc_norm|62.83|±  |  2.95|
+|agieval_sat_en                |      0|acc     |78.64|±  |  2.86|
+|                              |       |acc_norm|79.13|±  |  2.84|
+|agieval_sat_en_without_passage|      0|acc     |44.66|±  |  3.47|
+|                              |       |acc_norm|44.66|±  |  3.47|
+|agieval_sat_math              |      0|acc     |37.27|±  |  3.27|
+|                              |       |acc_norm|35.00|±  |  3.22|
+Average: 45.31%
+### GPT4All
+|    Task     |Version| Metric |Value|   |Stderr|
+|-------------|------:|--------|----:|---|-----:|
+|arc_challenge|      0|acc     |63.14|±  |  1.41|
+|             |       |acc_norm|64.51|±  |  1.40|
+|arc_easy     |      0|acc     |85.98|±  |  0.71|
+|             |       |acc_norm|82.49|±  |  0.78|
+|boolq        |      1|acc     |88.10|±  |  0.57|
+|hellaswag    |      0|acc     |66.31|±  |  0.47|
+|             |       |acc_norm|85.17|±  |  0.35|
+|openbookqa   |      0|acc     |38.00|±  |  2.17|
+|             |       |acc_norm|47.20|±  |  2.23|
+|piqa         |      0|acc     |83.35|±  |  0.87|
+|             |       |acc_norm|84.17|±  |  0.85|
+|winogrande   |      0|acc     |78.22|±  |  1.16|
+Average: 75.69%
+### TruthfulQA
+|    Task     |Version|Metric|Value|   |Stderr|
+|-------------|------:|------|----:|---|-----:|
+|truthfulqa_mc|      1|mc1   |47.74|±  |  1.75|
+|             |       |mc2   |63.94|±  |  1.53|
+Average: 63.94%
+### Bigbench
+|                      Task                      |Version|       Metric        |Value|   |Stderr|
+|------------------------------------------------|------:|---------------------|----:|---|-----:|
+|bigbench_causal_judgement                       |      0|multiple_choice_grade|56.84|±  |  3.60|
+|bigbench_date_understanding                     |      0|multiple_choice_grade|66.12|±  |  2.47|
+|bigbench_disambiguation_qa                      |      0|multiple_choice_grade|41.47|±  |  3.07|
+|bigbench_geometric_shapes                       |      0|multiple_choice_grade|22.01|±  |  2.19|
+|                                                |       |exact_str_match      | 0.00|±  |  0.00|
+|bigbench_logical_deduction_five_objects         |      0|multiple_choice_grade|31.40|±  |  2.08|
+|bigbench_logical_deduction_seven_objects        |      0|multiple_choice_grade|23.14|±  |  1.60|
+|bigbench_logical_deduction_three_objects        |      0|multiple_choice_grade|56.00|±  |  2.87|
+|bigbench_movie_recommendation                   |      0|multiple_choice_grade|45.00|±  |  2.23|
+|bigbench_navigate                               |      0|multiple_choice_grade|50.70|±  |  1.58|
+|bigbench_reasoning_about_colored_objects        |      0|multiple_choice_grade|70.05|±  |  1.02|
+|bigbench_ruin_names                             |      0|multiple_choice_grade|45.54|±  |  2.36|
+|bigbench_salient_translation_error_detection    |      0|multiple_choice_grade|26.05|±  |  1.39|
+|bigbench_snarks                                 |      0|multiple_choice_grade|71.82|±  |  3.35|
+|bigbench_sports_understanding                   |      0|multiple_choice_grade|72.92|±  |  1.42|
+|bigbench_temporal_sequences                     |      0|multiple_choice_grade|44.20|±  |  1.57|
+|bigbench_tracking_shuffled_objects_five_objects |      0|multiple_choice_grade|22.80|±  |  1.19|
+|bigbench_tracking_shuffled_objects_seven_objects|      0|multiple_choice_grade|18.23|±  |  0.92|
+|bigbench_tracking_shuffled_objects_three_objects|      0|multiple_choice_grade|56.00|±  |  2.87|
+Average: 45.57%
+Average score: 57.63%
+## 💻 Usage
+```python
+!pip install -qU transformers accelerate
+from transformers import AutoTokenizer
+import transformers
+import torch
+model = "cstr/Spaetzle-v8-7b"
+messages = [{"role": "user", "content": "What is a large language model?"}]
+tokenizer = AutoTokenizer.from_pretrained(model)
+prompt = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
+pipeline = transformers.pipeline(
+    "text-generation",
+    model=model,
+    torch_dtype=torch.float16,
+    device_map="auto",
+)
+outputs = pipeline(prompt, max_new_tokens=256, do_sample=True, temperature=0.7, top_k=50, top_p=0.95)
+print(outputs[0]["generated_text"])
+```
+## 🧩 Configuration
+The model uses ChatML and should work well with this (as it is merged from models which (mostly) saw ChatML templates in training).
+```yaml
+models:
+  - model: mayflowergmbh/Wiedervereinigung-7b-dpo-laser
+    # no parameters necessary for base model
+  - model: flemmingmiguel/NeuDist-Ro-7B
+    parameters:
+      density: 0.60
+      weight: 0.30
+  - model: johannhartmann/Brezn3
+    parameters:
+      density: 0.65
+      weight: 0.40
+  - model: ResplendentAI/Flora_DPO_7B
+    parameters:
+      density: 0.6
+      weight: 0.3
+merge_method: dare_ties
+base_model: mayflowergmbh/Wiedervereinigung-7b-dpo-laser
+parameters:
+  int8_mask: true
+dtype: bfloat16
+random_seed: 0
+tokenizer_source: base
+```