viethoangtranduong
commited on
Commit
•
2fdf351
1
Parent(s):
9602d48
Update README.md
Browse files
README.md
CHANGED
@@ -110,6 +110,8 @@ allowing for deployment in environments requiring moderated outputs.
|
|
110 |
which proposes a similar general approach for creating alignment pairs from a larger set of candidate responses, but using the LLM as the reward model.
|
111 |
While this may work for general-purpose models, our experience has shown that task-specific reward models guided by SMEs are necessary for most
|
112 |
enterprise applications of LLMs for specific use cases, which is why we focus on the use of external reward models.
|
|
|
|
|
113 |
|
114 |
### GGUF version
|
115 |
Snorkel-Mistral-PairRM-DPO GGUF model version: from [andrew-cartwheel](https://huggingface.co/andrew-cartwheel/snorkel-mistral-pairRM-DPO-q8_0.gguf) or [brittlewis12](https://huggingface.co/brittlewis12/Snorkel-Mistral-PairRM-DPO-GGUF).
|
|
|
110 |
which proposes a similar general approach for creating alignment pairs from a larger set of candidate responses, but using the LLM as the reward model.
|
111 |
While this may work for general-purpose models, our experience has shown that task-specific reward models guided by SMEs are necessary for most
|
112 |
enterprise applications of LLMs for specific use cases, which is why we focus on the use of external reward models.
|
113 |
+
- Also, we would like to acknowledge another concurrent work that has a similar approach but focuses more on the theoretical aspect of the iterative DPO process: [Iterative Preference Learning from Human Feedback: Bridging Theory and
|
114 |
+
Practice for RLHF under KL-Constraint](https://arxiv.org/pdf/2312.11456.pdf) on 2024-01-28 (Xiong, et al).
|
115 |
|
116 |
### GGUF version
|
117 |
Snorkel-Mistral-PairRM-DPO GGUF model version: from [andrew-cartwheel](https://huggingface.co/andrew-cartwheel/snorkel-mistral-pairRM-DPO-q8_0.gguf) or [brittlewis12](https://huggingface.co/brittlewis12/Snorkel-Mistral-PairRM-DPO-GGUF).
|