snorkelai
/

Snorkel-Mistral-PairRM-DPO

Text Generation

text-generation-inference

Inference Endpoints

Model card Files Files and versions Community

bradenjh commited on Jan 22

Commit

5db967d

•

1 Parent(s): 30b5d17

Update README.md

Files changed (1) hide show

README.md +2 -2

README.md CHANGED Viewed

@@ -62,8 +62,8 @@ allowing for deployment in environments requiring moderated outputs.
 - The author of the [Direct Preference Optimization paper](https://arxiv.org/abs/2305.18290) for the innovative approach
 - The author of the [Pairwise Reward Model for LLMs paper](https://arxiv.org/abs/2306.02561) for the powerful general-purpose reward model
 - The HuggingFace team for the DPO implementation under [The Alignment Handbook](https://github.com/huggingface/alignment-handbook)
-- We would also like to acknowledge contemporary work published on arXiv a few days ago by Meta & NYU (Yuan, et al) in a paper called [Self-Rewarding Language Models](https://arxiv.org/abs/2401.10020),
-which proposes a similar approach for creating alignment pairs from a larger set of candidate responses, but using the LLM as the reward model.
 While this may work for general-purpose models, our experience has shown that task-specific reward models guided by SMEs are necessary for most
 enterprise applications of LLMs for specific use cases, which is why we focus on the use of external reward models.

 - The author of the [Direct Preference Optimization paper](https://arxiv.org/abs/2305.18290) for the innovative approach
 - The author of the [Pairwise Reward Model for LLMs paper](https://arxiv.org/abs/2306.02561) for the powerful general-purpose reward model
 - The HuggingFace team for the DPO implementation under [The Alignment Handbook](https://github.com/huggingface/alignment-handbook)
+- We would also like to acknowledge contemporary work published independently on arXiv on 2024-01-18 by Meta & NYU (Yuan, et al) in a paper called [Self-Rewarding Language Models](https://arxiv.org/abs/2401.10020),
+which proposes a similar general approach for creating alignment pairs from a larger set of candidate responses, but using the LLM as the reward model.
 While this may work for general-purpose models, our experience has shown that task-specific reward models guided by SMEs are necessary for most
 enterprise applications of LLMs for specific use cases, which is why we focus on the use of external reward models.