--- license: apache-2.0 datasets: - argilla/distilabel-intel-orca-dpo-pairs language: - en tags: - distilabel - dpo - rlaif - rlhf --- # ⚗️ distilabeled OpenHermes 2.5 Mistral 7B > A Neural DPO of OpenHermes 2.5, high quality matters for DPO!

Built with Distilabel

## Introduction This model is the virtual launching partner of our new open dataset [argilla/distilabel-intel-orca-dpo-pairs](https://huggingface.co/datasets/argilla/distilabel-intel-orca-dpo-pairs). It's a DPO fine tune of [OpenHermes-2.5-Mistral-7B](https://huggingface.co/teknium/OpenHermes-2.5-Mistral-7B). It outperforms the awesome `mlabonne/NeuralHermes-2.5-Mistral-7B` with the **exact same DPO recipe but using our new orca-pairs dataset**. The dataset is a "distilabeled" version of the widely used dataset: [Intel/orca_dpo_pairs](https://huggingface.co/datasets/Intel/orca_dpo_pairs). The original dataset has been used by 100s of open source practitioners and models. We knew from fixing UltraFeedback (and before that, Alpacas and Dollys) that this dataset could be highly improved. Continuing with our mission to build the best alignment datasets for open source LLMs and the community, we spent a few hours to improve it with [distilabel](https://github.com/argilla-io/distilabel). The main intuition was: the original dataset just assumes gpt4/3.5-turbo are always the best response. We know from UltraFeedback that's not always the case. Moreover, DPO fine-tuning benefits from diversity of preference pairs. This is what it took to build a real preference dataset with distilabel: ```python from distilabel.llm import OpenAILLM from distilabel.tasks import JudgeLMTask from distilabel.pipeline import Pipeline from datasets import load_dataset dataset = load_dataset("Intel/orca_dpo_pairs", split="train") # this shuffles the pairs to mitigate positional bias dataset = dataset.map(lambda x: shuffle_and_track(x["chosen"], x["rejected"])) # we use our JudgeLM implementation to rate the original pairs labeler = OpenAILLM( task=JudgeLMTask(), model="gpt-4-1106-preview", num_threads=16, max_new_tokens=512, ) dataset = dataset.rename_columns({"question": "input"}) distipipe = Pipeline( labeller=labeler ) # this computes ratings and natural language critiques for each pair ds = distipipe.generate(dataset=dataset, num_generations=2) ``` The resulting dataset is now much more useful: we know which response is preferred (by gpt-4-turbo), which ones have low scores, and we even have natural language explanations. But what did we find? Was our intuition confirmed? ![image/png](https://cdn-uploads.huggingface.co/production/uploads/60420dccc15e823a685f2b03/-V8wY1DYzrtwM9LbGrBXq.png) The above chart shows the following: * ~4,000 pairs were given the same rating (a tie). * ~7,000 pairs were correct according to our AI judge (`unchanged`). * and ~2,000 times the rejected response was preferred (`swapped`). Now the next question is: can we build better models with this new knowledge? The answer is "distilabeled Hermes" so let's get back to the model! > If you love datasets as much as we do, check the [dataset](https://huggingface.co/datasets/argilla/distilabel-intel-orca-dpo-pairs) and share it with your friends and colleagues. ## Training details As we did with [Notus](https://argilla.io/blog/notus7b/), we wanted a reproducible recipe to test the impact of data quality. And we're lucky to have so many amazing folks in the open community contributing reproducible, easy-to-use training scripts and recipes. This time, [Maxime Labonne](https://twitter.com/maximelabonne) had shared a [Colab](https://colab.research.google.com/drive/15iFBr1xWgztXvhrj5I9fBv20c7CFOPBE?usp=sharing) to fine-tune OpenHermes with DPO and the original Intel's dataset, perfect! (funnily enough this exact recipe has been used recently to fine-tune the [top ranked 7B model](https://huggingface.co/CultriX/MistralTrix-v1)). And that's all for the model part: we reused a good, reproducible recipe. Once we had created the dataset, the training data part is also kind of boring: we just filtered the samples based on our intuition and with the goal of reducing the dataset size: * Ties probably won't help the DPO tuning to learn something meaningful: both responses are similarly good or bad (filter out `ties`) * Very good chosen responses will steer the model to generate good responses (score of chosen response >=8) Additionally, we did some "decontamination" of gsm8k prompts (very few that were present in the train split of gsm8k). In code, using our new dataset this translates into: ```python from datasets import load_dataset # Instead of this: # dataset = load_dataset("Intel/orca_dpo_pairs", split="train") # we did this dataset = load_dataset("argilla/distilabel-intel-orca-dpo-pairs", split="train") dataset = dataset.filter( lambda r: r["status"] != "tie" and r["chosen_score"] >= 8 and not r["in_gsm8k_train"] ) ``` This resulted in `5,922` instead of `12,859` samples (54% reduction) and we run it for 200 steps (using around ~3.2K samples). ## Benchmark results For benchmarking we used the famous "Nous" or "Teknium" benchmark. You can find below an overview, including our first experiment with a less ambitious dataset filtering (removing ties and `score>5`). For running the benchmark we used another awesome contribution from Maxime: [LLM AutoEval](https://github.com/mlabonne/llm-autoeval), check it out! | Model | AGIEval | GPT4All | TruthfulQA | Bigbench | Average | |-------------------------------------------------------------------------------------------------------------------|--------:|--------:|-----------:|---------:|--------:| | [argilla/distilabeled-Hermes-2.5-Mistral-7B](https://huggingface.co/argilla/distilabeled-Hermes-2.5-Mistral-7B) | **44.64** | **73.35** | 55.96 | 42.21 | **54.04** | | [dvilasuero/NeuralHermes-2.5-Mistral-7B-distilabel](https://huggingface.co/dvilasuero/NeuralHermes-2.5-Mistral-7B-distilabel) (first experiment) | 44.27 | 73.3 | **56.26** | **42.25** | 54.02 | | mlabonne/NeuralHermes-2.5-Mistral-7B (original recipe) | 43.67 | 73.24 | 55.37 | 41.76 | 53.51 | | teknium/OpenHermes-2.5-Mistral-7B | 42.75 | 72.99 | 52.99 | 40.94 | 52.42| > Update: we now include llm-harness results too! | Model | ARC | HellaSwag | MMLU | TruthfulQA | Winogrande | GSM8K | |------------------------------------------------------|-------|-----------|------|-----------:|------------|-------| | [argilla/distilabeled-Hermes-2.5-Mistral-7B](https://huggingface.co/argilla/distilabeled-Hermes-2.5-Mistral-7B) | 66.04 | **85.07** | Pending | 55.96 | **79.56** | **66.34** | | [dvilasuero/NeuralHermes-2.5-Mistral-7B-distilabel](https://huggingface.co/dvilasuero/NeuralHermes-2.5-Mistral-7B-distilabel) | 65.36 | 84.74 | Pending | **56.26** | 79.24 | 65.13 | | [mlabonne/NeuralHermes-2.5-Mistral-7B](https://huggingface.co/mlabonne/NeuralHermes-2.5-Mistral-7B) | **66.55** | 84.90 | **63.32** | 54.93 | 78.30 | 61.30 | | [teknium/OpenHermes-2.5-Mistral-7B](https://huggingface.co/teknium/OpenHermes-2.5-Mistral-7B) | 64.93 | 84.18 | 63.64 | 52.24 | 78.06 | 26.08 | ### Training Hardware We used 1 x A100 40GB in runpod for less than 1 hour. ## Acknowledgements We'd like to thank the amazing open community and in particular: * The Intel team for publishing a great open dataset and show how well it worked in the first place * Teknium and NousResearch for their awesome work and models. * Maxime for sharing such great resources.