viethoangtranduong commited on
Commit
2fdf351
1 Parent(s): 9602d48

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +2 -0
README.md CHANGED
@@ -110,6 +110,8 @@ allowing for deployment in environments requiring moderated outputs.
110
  which proposes a similar general approach for creating alignment pairs from a larger set of candidate responses, but using the LLM as the reward model.
111
  While this may work for general-purpose models, our experience has shown that task-specific reward models guided by SMEs are necessary for most
112
  enterprise applications of LLMs for specific use cases, which is why we focus on the use of external reward models.
 
 
113
 
114
  ### GGUF version
115
  Snorkel-Mistral-PairRM-DPO GGUF model version: from [andrew-cartwheel](https://huggingface.co/andrew-cartwheel/snorkel-mistral-pairRM-DPO-q8_0.gguf) or [brittlewis12](https://huggingface.co/brittlewis12/Snorkel-Mistral-PairRM-DPO-GGUF).
 
110
  which proposes a similar general approach for creating alignment pairs from a larger set of candidate responses, but using the LLM as the reward model.
111
  While this may work for general-purpose models, our experience has shown that task-specific reward models guided by SMEs are necessary for most
112
  enterprise applications of LLMs for specific use cases, which is why we focus on the use of external reward models.
113
+ - Also, we would like to acknowledge another concurrent work that has a similar approach but focuses more on the theoretical aspect of the iterative DPO process: [Iterative Preference Learning from Human Feedback: Bridging Theory and
114
+ Practice for RLHF under KL-Constraint](https://arxiv.org/pdf/2312.11456.pdf) on 2024-01-28 (Xiong, et al).
115
 
116
  ### GGUF version
117
  Snorkel-Mistral-PairRM-DPO GGUF model version: from [andrew-cartwheel](https://huggingface.co/andrew-cartwheel/snorkel-mistral-pairRM-DPO-q8_0.gguf) or [brittlewis12](https://huggingface.co/brittlewis12/Snorkel-Mistral-PairRM-DPO-GGUF).