snorkelai
/

Snorkel-Mistral-PairRM-DPO

Text Generation

text-generation-inference

Inference Endpoints

Model card Files Files and versions Community

viethoangtranduong commited on Jan 22

Commit

ccbadf0

•

1 Parent(s): b6c497a

Update README.md

Files changed (1) hide show

README.md +7 -1

README.md CHANGED Viewed

@@ -3,6 +3,9 @@ license: apache-2.0
 datasets:
 - openbmb/UltraFeedback
 ---
 #### Dataset and Process:
 - **Dataset**: ONLY the prompts from [UltraFeedback](https://huggingface.co/datasets/HuggingFaceH4/ultrafeedback_binarized); **no external LLM responses used**.
@@ -11,6 +14,9 @@ datasets:
   2. Apply [PairRM](https://huggingface.co/llm-blender/PairRM) for response reranking.
   3. Update the LLM by applying Direct Preference Optimization (DPO) on the top (chosen) and bottom (rejected) responses.
   4. Use this LLM as the base model for the next iteration, repeating three times in total.
 #### Key Premises:
 - **Specialization Requirement**: In enterprises, you will have very specific advanced alignment axes, where your LLMs currently do not have such awareness yet.
@@ -26,7 +32,7 @@ most enterprise applications of LLMs to specific use cases.
 #### Applications:
 Unlike our customers, who have very specific use cases to align LLMs to,
 the AlpacaEval 2.0 leaderboard measures the ability of LLMS to follow general user instructions.
-Thus, for this demonstration, we use a general-purpose reward model - the performant [PairRM model](https://huggingface.co/llm-blender/PairRM) [citation].
 We use the [Mistral-7B-Instruct-v0.2](https://huggingface.co/mistralai/Mistral-7B-Instruct-v0.2) model as our base LLM.
 With this demonstration, we focus on the general approach of programmatic alignment.

 datasets:
 - openbmb/UltraFeedback
 ---
+Original post: [Snorkel link]
 #### Dataset and Process:
 - **Dataset**: ONLY the prompts from [UltraFeedback](https://huggingface.co/datasets/HuggingFaceH4/ultrafeedback_binarized); **no external LLM responses used**.
   2. Apply [PairRM](https://huggingface.co/llm-blender/PairRM) for response reranking.
   3. Update the LLM by applying Direct Preference Optimization (DPO) on the top (chosen) and bottom (rejected) responses.
   4. Use this LLM as the base model for the next iteration, repeating three times in total.
+This overview provides a high-level summary of our approach.
+We plan to release more detailed results and findings in the coming weeks on the [Snorkel blogs](https://snorkel.ai/blog/).
 #### Key Premises:
 - **Specialization Requirement**: In enterprises, you will have very specific advanced alignment axes, where your LLMs currently do not have such awareness yet.
 #### Applications:
 Unlike our customers, who have very specific use cases to align LLMs to,
 the AlpacaEval 2.0 leaderboard measures the ability of LLMS to follow general user instructions.
+Thus, for this demonstration, we use a general-purpose reward model - the performant [PairRM model](https://huggingface.co/llm-blender/PairRM).
 We use the [Mistral-7B-Instruct-v0.2](https://huggingface.co/mistralai/Mistral-7B-Instruct-v0.2) model as our base LLM.
 With this demonstration, we focus on the general approach of programmatic alignment.