viethoangtranduong
commited on
Commit
•
ccbadf0
1
Parent(s):
b6c497a
Update README.md
Browse files
README.md
CHANGED
@@ -3,6 +3,9 @@ license: apache-2.0
|
|
3 |
datasets:
|
4 |
- openbmb/UltraFeedback
|
5 |
---
|
|
|
|
|
|
|
6 |
#### Dataset and Process:
|
7 |
- **Dataset**: ONLY the prompts from [UltraFeedback](https://huggingface.co/datasets/HuggingFaceH4/ultrafeedback_binarized); **no external LLM responses used**.
|
8 |
|
@@ -11,6 +14,9 @@ datasets:
|
|
11 |
2. Apply [PairRM](https://huggingface.co/llm-blender/PairRM) for response reranking.
|
12 |
3. Update the LLM by applying Direct Preference Optimization (DPO) on the top (chosen) and bottom (rejected) responses.
|
13 |
4. Use this LLM as the base model for the next iteration, repeating three times in total.
|
|
|
|
|
|
|
14 |
|
15 |
#### Key Premises:
|
16 |
- **Specialization Requirement**: In enterprises, you will have very specific advanced alignment axes, where your LLMs currently do not have such awareness yet.
|
@@ -26,7 +32,7 @@ most enterprise applications of LLMs to specific use cases.
|
|
26 |
#### Applications:
|
27 |
Unlike our customers, who have very specific use cases to align LLMs to,
|
28 |
the AlpacaEval 2.0 leaderboard measures the ability of LLMS to follow general user instructions.
|
29 |
-
Thus, for this demonstration, we use a general-purpose reward model - the performant [PairRM model](https://huggingface.co/llm-blender/PairRM)
|
30 |
We use the [Mistral-7B-Instruct-v0.2](https://huggingface.co/mistralai/Mistral-7B-Instruct-v0.2) model as our base LLM.
|
31 |
|
32 |
With this demonstration, we focus on the general approach of programmatic alignment.
|
|
|
3 |
datasets:
|
4 |
- openbmb/UltraFeedback
|
5 |
---
|
6 |
+
|
7 |
+
Original post: [Snorkel link]
|
8 |
+
|
9 |
#### Dataset and Process:
|
10 |
- **Dataset**: ONLY the prompts from [UltraFeedback](https://huggingface.co/datasets/HuggingFaceH4/ultrafeedback_binarized); **no external LLM responses used**.
|
11 |
|
|
|
14 |
2. Apply [PairRM](https://huggingface.co/llm-blender/PairRM) for response reranking.
|
15 |
3. Update the LLM by applying Direct Preference Optimization (DPO) on the top (chosen) and bottom (rejected) responses.
|
16 |
4. Use this LLM as the base model for the next iteration, repeating three times in total.
|
17 |
+
|
18 |
+
This overview provides a high-level summary of our approach.
|
19 |
+
We plan to release more detailed results and findings in the coming weeks on the [Snorkel blogs](https://snorkel.ai/blog/).
|
20 |
|
21 |
#### Key Premises:
|
22 |
- **Specialization Requirement**: In enterprises, you will have very specific advanced alignment axes, where your LLMs currently do not have such awareness yet.
|
|
|
32 |
#### Applications:
|
33 |
Unlike our customers, who have very specific use cases to align LLMs to,
|
34 |
the AlpacaEval 2.0 leaderboard measures the ability of LLMS to follow general user instructions.
|
35 |
+
Thus, for this demonstration, we use a general-purpose reward model - the performant [PairRM model](https://huggingface.co/llm-blender/PairRM).
|
36 |
We use the [Mistral-7B-Instruct-v0.2](https://huggingface.co/mistralai/Mistral-7B-Instruct-v0.2) model as our base LLM.
|
37 |
|
38 |
With this demonstration, we focus on the general approach of programmatic alignment.
|