viethoangtranduong
commited on
Commit
•
df53ac2
1
Parent(s):
e6e8d18
Update README.md
Browse files
README.md
CHANGED
@@ -6,17 +6,17 @@ datasets:
|
|
6 |
|
7 |
Original post: [Snorkel link]
|
8 |
|
9 |
-
#### Dataset
|
10 |
-
|
11 |
|
12 |
-
|
13 |
1. Generate five response variations for each prompt from a subset of 20,000 using the LLM - to start, we used [Mistral-7B-Instruct-v0.2](https://huggingface.co/mistralai/Mistral-7B-Instruct-v0.2).
|
14 |
2. Apply [PairRM](https://huggingface.co/llm-blender/PairRM) for response reranking.
|
15 |
3. Update the LLM by applying Direct Preference Optimization (DPO) on the top (chosen) and bottom (rejected) responses.
|
16 |
4. Use this LLM as the base model for the next iteration, repeating three times in total.
|
17 |
|
18 |
This overview provides a high-level summary of our approach.
|
19 |
-
We plan to release more detailed results and findings in the coming weeks on
|
20 |
|
21 |
#### Key Premises:
|
22 |
- **Specialization Requirement**: In enterprises, you will have very specific advanced alignment axes, where your LLMs currently do not have such awareness yet.
|
@@ -38,8 +38,8 @@ We use the [Mistral-7B-Instruct-v0.2](https://huggingface.co/mistralai/Mistral-7
|
|
38 |
With this demonstration, we focus on the general approach of programmatic alignment.
|
39 |
|
40 |
For interest in building your **specialized internal reward models
|
41 |
-
that reflect your enterprises' needs**, please contact the Snorkel team or consider attending
|
42 |
-
**
|
43 |
to learn more about "Programmatically scale human preferences and alignment in GenAI".
|
44 |
|
45 |
|
|
|
6 |
|
7 |
Original post: [Snorkel link]
|
8 |
|
9 |
+
#### Dataset:
|
10 |
+
ONLY the prompts from [UltraFeedback](https://huggingface.co/datasets/HuggingFaceH4/ultrafeedback_binarized); **no external LLM responses used**.
|
11 |
|
12 |
+
#### Methodology:
|
13 |
1. Generate five response variations for each prompt from a subset of 20,000 using the LLM - to start, we used [Mistral-7B-Instruct-v0.2](https://huggingface.co/mistralai/Mistral-7B-Instruct-v0.2).
|
14 |
2. Apply [PairRM](https://huggingface.co/llm-blender/PairRM) for response reranking.
|
15 |
3. Update the LLM by applying Direct Preference Optimization (DPO) on the top (chosen) and bottom (rejected) responses.
|
16 |
4. Use this LLM as the base model for the next iteration, repeating three times in total.
|
17 |
|
18 |
This overview provides a high-level summary of our approach.
|
19 |
+
We plan to release more detailed results and findings in the coming weeks on [Snorkel blogs](https://snorkel.ai/blog/).
|
20 |
|
21 |
#### Key Premises:
|
22 |
- **Specialization Requirement**: In enterprises, you will have very specific advanced alignment axes, where your LLMs currently do not have such awareness yet.
|
|
|
38 |
With this demonstration, we focus on the general approach of programmatic alignment.
|
39 |
|
40 |
For interest in building your **specialized internal reward models
|
41 |
+
that reflect your enterprises' needs**, please contact the Snorkel team or consider attending our
|
42 |
+
[**Enterprise LLM Summit: Building GenAI with Your Data on January 25, 2024**](https://snorkel.ai/event/enterprise-llm-summit/)
|
43 |
to learn more about "Programmatically scale human preferences and alignment in GenAI".
|
44 |
|
45 |
|