snorkelai
/

Snorkel-Mistral-PairRM-DPO

@@ -6,17 +6,17 @@ datasets:
 Original post: [Snorkel link]
-#### Dataset and Process:
-- **Dataset**: ONLY the prompts from [UltraFeedback](https://huggingface.co/datasets/HuggingFaceH4/ultrafeedback_binarized); **no external LLM responses used**.
-- **Methodology:**:
   1. Generate five response variations for each prompt from a subset of 20,000 using the LLM - to start, we used [Mistral-7B-Instruct-v0.2](https://huggingface.co/mistralai/Mistral-7B-Instruct-v0.2).
   2. Apply [PairRM](https://huggingface.co/llm-blender/PairRM) for response reranking.
   3. Update the LLM by applying Direct Preference Optimization (DPO) on the top (chosen) and bottom (rejected) responses.
   4. Use this LLM as the base model for the next iteration, repeating three times in total.
 This overview provides a high-level summary of our approach.
-We plan to release more detailed results and findings in the coming weeks on the [Snorkel blogs](https://snorkel.ai/blog/).
 #### Key Premises:
 - **Specialization Requirement**: In enterprises, you will have very specific advanced alignment axes, where your LLMs currently do not have such awareness yet.
@@ -38,8 +38,8 @@ We use the [Mistral-7B-Instruct-v0.2](https://huggingface.co/mistralai/Mistral-7
 With this demonstration, we focus on the general approach of programmatic alignment.
 For interest in building your **specialized internal reward models
-that reflect your enterprises' needs**, please contact the Snorkel team or consider attending the
-**[Enterprise LLM Summit: Building GenAI with Your Data](https://snorkel.ai/event/enterprise-llm-summit/)**
 to learn more about "Programmatically scale human preferences and alignment in GenAI".

 Original post: [Snorkel link]
+#### Dataset:
+ONLY the prompts from [UltraFeedback](https://huggingface.co/datasets/HuggingFaceH4/ultrafeedback_binarized); **no external LLM responses used**.
+#### Methodology:
   1. Generate five response variations for each prompt from a subset of 20,000 using the LLM - to start, we used [Mistral-7B-Instruct-v0.2](https://huggingface.co/mistralai/Mistral-7B-Instruct-v0.2).
   2. Apply [PairRM](https://huggingface.co/llm-blender/PairRM) for response reranking.
   3. Update the LLM by applying Direct Preference Optimization (DPO) on the top (chosen) and bottom (rejected) responses.
   4. Use this LLM as the base model for the next iteration, repeating three times in total.
 This overview provides a high-level summary of our approach.
+We plan to release more detailed results and findings in the coming weeks on [Snorkel blogs](https://snorkel.ai/blog/).
 #### Key Premises:
 - **Specialization Requirement**: In enterprises, you will have very specific advanced alignment axes, where your LLMs currently do not have such awareness yet.
 With this demonstration, we focus on the general approach of programmatic alignment.
 For interest in building your **specialized internal reward models
+that reflect your enterprises' needs**, please contact the Snorkel team or consider attending our
+[**Enterprise LLM Summit: Building GenAI with Your Data on January 25, 2024**](https://snorkel.ai/event/enterprise-llm-summit/)
 to learn more about "Programmatically scale human preferences and alignment in GenAI".