snorkelai
/

Snorkel-Mistral-PairRM-DPO

@@ -16,18 +16,15 @@ ONLY the prompts from [UltraFeedback](https://huggingface.co/datasets/HuggingFac
   4. Use this LLM as the base model for the next iteration, repeating three times in total.
 This overview provides a high-level summary of our approach.
-We plan to release more detailed results and findings in the coming weeks on [Snorkel blogs](https://snorkel.ai/blog/).
 #### Key Premises:
-- **Specialization Requirement**: In enterprises, you will have very specific advanced alignment axes, where your LLMs currently do not have such awareness yet.
 - **Ease of Model Building**: Creating ranking/scoring/classification models is simpler than developing high-quality, manually annotated datasets for long-form responses.
-- **Programmatic Alignment**: Using smaller but specialized teacher models (reward models) can incrementally align LLMs towards specific axes. We call this **Programmatic Alignment** - using programmatic, weak signals to guide your LLM improvement. Multiple reward models can be scaled to different axes as required.
 #### Contemporary Work and Acknowledgements:
-We would also like to acknowledge contemporary work published a few days ago by Meta & NYU in a paper called [Self-Rewarding Language Models](https://arxiv.org/abs/2401.10020),
-which proposes a similar approach for creating alignment pairs from a larger set of candidate responses but using their LLM as the reward model.
-While this may work for general-purpose models, our experience has shown that task-specific reward models guided by SMEs are necessary for
-most enterprise applications of LLMs to specific use cases.
 #### Applications:
 Unlike our customers, who have very specific use cases to align LLMs to,
@@ -38,10 +35,9 @@ We use the [Mistral-7B-Instruct-v0.2](https://huggingface.co/mistralai/Mistral-7
 With this demonstration, we focus on the general approach of programmatic alignment.
 For interest in building your **specialized internal reward models
-that reflect your enterprises' needs**, please contact the Snorkel team or consider attending our
 [**Enterprise LLM Summit: Building GenAI with Your Data on January 25, 2024**](https://snorkel.ai/event/enterprise-llm-summit/)
-to learn more about "Programmatically scale human preferences and alignment in GenAI".
 #### Result:
 On [**Alpaca-Eval 2.0**](https://tatsu-lab.github.io/alpaca_eval/):
@@ -49,7 +45,7 @@ On [**Alpaca-Eval 2.0**](https://tatsu-lab.github.io/alpaca_eval/):
 After applying the above methodology:
 - This model scored **30.2** - ranked 3rd and the highest for an open-source base model at the time of publication.
 - Utilizing the model with PairRM, which involved generating 16 responses and submitting the highest-scoring response by PairRM, we scored **34.86** - ranked 2nd.
-The best model on the leaderboard is "gpt-4-turbo".
 We recognize that the Alpaca-Eval 2.0 benchmark does not entirely capture the full range of capabilities and performances of LLMs.
 However, in our current work, where the goal is to align with general "human preferences," Alpaca-Eval 2.0 serves as a suitable and representative benchmark.
@@ -57,14 +53,19 @@ Moving forward, we anticipate further contributions from the community regarding
 ## Limitations:
 The model is a quick demonstration that the LLMs can be programmatically aligned using smaller specialized reward models.
-It does not have any moderation mechanisms. We're looking forward to engaging with the community on ways to
-make the model finely respect guardrails, allowing for deployment in environments requiring moderated outputs.
-## Acknowledgments
 - The Mistral AI Team for developing and releasing the advanced Mistral-7B-Instruct-v0.2 model.
 - The author of the [Direct Preference Optimization paper](https://arxiv.org/abs/2305.18290) for the innovative approach
 - The author of the [Pairwise Reward Model for LLMs paper](https://arxiv.org/abs/2306.02561) for the powerful general-purpose reward model
 - The HuggingFace team for the DPO implementation under [The Alignment Handbook](https://github.com/huggingface/alignment-handbook)
 ## The Snorkel AI Team
-Hoang Tran, Chris Glaze, Braden Hancock, Alex Ratner

   4. Use this LLM as the base model for the next iteration, repeating three times in total.
 This overview provides a high-level summary of our approach.
+We plan to release more detailed results and findings in the coming weeks on the [Snorkel blog](https://snorkel.ai/blog/).
 #### Key Premises:
+- **Specialization Requirement**: For most enterprise use cases, using LLMs "off-the-shelf" falls short of production quality, necessitating additional fine-tuning and alignment.
 - **Ease of Model Building**: Creating ranking/scoring/classification models is simpler than developing high-quality, manually annotated datasets for long-form responses.
+- **Programmatic Alignment**: Using smaller but specialized teacher models (reward models) can incrementally align LLMs towards specific axes. We call this **Programmatic Alignment** - capturing domain knowledge in programmatic forms that can be used to guide LLM improvement.
 #### Contemporary Work and Acknowledgements:
 #### Applications:
 Unlike our customers, who have very specific use cases to align LLMs to,
 With this demonstration, we focus on the general approach of programmatic alignment.
 For interest in building your **specialized internal reward models
+that reflect your enterprises' needs**, please contact the Snorkel AI team or consider attending our
 [**Enterprise LLM Summit: Building GenAI with Your Data on January 25, 2024**](https://snorkel.ai/event/enterprise-llm-summit/)
+to learn more about "Programmatically scaling human preferences and alignment in GenAI".
 #### Result:
 On [**Alpaca-Eval 2.0**](https://tatsu-lab.github.io/alpaca_eval/):
 After applying the above methodology:
 - This model scored **30.2** - ranked 3rd and the highest for an open-source base model at the time of publication.
 - Utilizing the model with PairRM, which involved generating 16 responses and submitting the highest-scoring response by PairRM, we scored **34.86** - ranked 2nd.
+The best model on the leaderboard is "gpt-4-turbo", which is also the judge of optimal responses.
 We recognize that the Alpaca-Eval 2.0 benchmark does not entirely capture the full range of capabilities and performances of LLMs.
 However, in our current work, where the goal is to align with general "human preferences," Alpaca-Eval 2.0 serves as a suitable and representative benchmark.
 ## Limitations:
 The model is a quick demonstration that the LLMs can be programmatically aligned using smaller specialized reward models.
+It does not have any moderation mechanisms.
+We look forward to continuing to engage with the research community and our customers exploring optimal methods for gettings models to respect guardrails,
+allowing for deployment in environments requiring moderated outputs.
+## Acknowledgments:
 - The Mistral AI Team for developing and releasing the advanced Mistral-7B-Instruct-v0.2 model.
 - The author of the [Direct Preference Optimization paper](https://arxiv.org/abs/2305.18290) for the innovative approach
 - The author of the [Pairwise Reward Model for LLMs paper](https://arxiv.org/abs/2306.02561) for the powerful general-purpose reward model
 - The HuggingFace team for the DPO implementation under [The Alignment Handbook](https://github.com/huggingface/alignment-handbook)
+- We would also like to acknowledge contemporary work published on arXiv a few days ago by Meta & NYU (Yuan, et al) in a paper called [Self-Rewarding Language Models](https://arxiv.org/abs/2401.10020),
+which proposes a similar approach for creating alignment pairs from a larger set of candidate responses, but using the LLM as the reward model.
+While this may work for general-purpose models, our experience has shown that task-specific reward models guided by SMEs are necessary for most
+enterprise applications of LLMs for specific use cases, which is why we focus on the use of external reward models.
 ## The Snorkel AI Team
+Hoang Tran, Chris Glaze, Braden Hancock