weqweasdas
/

RM-Mistral-7B

Text Classification

text-generation-inference

Inference Endpoints

Model card Files Files and versions Community

weqweasdas commited on Mar 26, 2024

Commit

e8adab3

·

verified ·

1 Parent(s): 102b5ac

Update README.md

Files changed (1) hide show

README.md +10 -1

README.md CHANGED Viewed

@@ -86,7 +86,7 @@ The reward model ranks 2nd in the [RewardBench](https://huggingface.co/spaces/al
 <!-- If there is a paper or blog post introducing the model, the APA and Bibtex information for that should go in this section. -->
-To be added. The reward model may be readily used for rejection sampling finetuning (
 ```
@@ -96,6 +96,15 @@ To be added. The reward model may be readily used for rejection sampling finetun
   journal={arXiv preprint arXiv:2304.06767},
   year={2023}
 }
 ```

 <!-- If there is a paper or blog post introducing the model, the APA and Bibtex information for that should go in this section. -->
+The repo was part of the iterative rejection sampling fine-tuning and iterative DPO. If you find the content of this repo useful in your work, please consider cite it as follows:
 ```
   journal={arXiv preprint arXiv:2304.06767},
   year={2023}
 }
+@misc{xiong2024iterative,
+      title={Iterative Preference Learning from Human Feedback: Bridging Theory and Practice for RLHF under KL-Constraint},
+      author={Wei Xiong and Hanze Dong and Chenlu Ye and Ziqi Wang and Han Zhong and Heng Ji and Nan Jiang and Tong Zhang},
+      year={2024},
+      eprint={2312.11456},
+      archivePrefix={arXiv},
+      primaryClass={cs.LG}
+}
 ```