Update README.md
Browse files
README.md
CHANGED
@@ -62,8 +62,8 @@ allowing for deployment in environments requiring moderated outputs.
|
|
62 |
- The author of the [Direct Preference Optimization paper](https://arxiv.org/abs/2305.18290) for the innovative approach
|
63 |
- The author of the [Pairwise Reward Model for LLMs paper](https://arxiv.org/abs/2306.02561) for the powerful general-purpose reward model
|
64 |
- The HuggingFace team for the DPO implementation under [The Alignment Handbook](https://github.com/huggingface/alignment-handbook)
|
65 |
-
- We would also like to acknowledge contemporary work published on arXiv
|
66 |
-
which proposes a similar approach for creating alignment pairs from a larger set of candidate responses, but using the LLM as the reward model.
|
67 |
While this may work for general-purpose models, our experience has shown that task-specific reward models guided by SMEs are necessary for most
|
68 |
enterprise applications of LLMs for specific use cases, which is why we focus on the use of external reward models.
|
69 |
|
|
|
62 |
- The author of the [Direct Preference Optimization paper](https://arxiv.org/abs/2305.18290) for the innovative approach
|
63 |
- The author of the [Pairwise Reward Model for LLMs paper](https://arxiv.org/abs/2306.02561) for the powerful general-purpose reward model
|
64 |
- The HuggingFace team for the DPO implementation under [The Alignment Handbook](https://github.com/huggingface/alignment-handbook)
|
65 |
+
- We would also like to acknowledge contemporary work published independently on arXiv on 2024-01-18 by Meta & NYU (Yuan, et al) in a paper called [Self-Rewarding Language Models](https://arxiv.org/abs/2401.10020),
|
66 |
+
which proposes a similar general approach for creating alignment pairs from a larger set of candidate responses, but using the LLM as the reward model.
|
67 |
While this may work for general-purpose models, our experience has shown that task-specific reward models guided by SMEs are necessary for most
|
68 |
enterprise applications of LLMs for specific use cases, which is why we focus on the use of external reward models.
|
69 |
|