weqweasdas commited on
Commit
e8adab3
1 Parent(s): 102b5ac

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +10 -1
README.md CHANGED
@@ -86,7 +86,7 @@ The reward model ranks 2nd in the [RewardBench](https://huggingface.co/spaces/al
86
 
87
  <!-- If there is a paper or blog post introducing the model, the APA and Bibtex information for that should go in this section. -->
88
 
89
- To be added. The reward model may be readily used for rejection sampling finetuning (
90
 
91
 
92
  ```
@@ -96,6 +96,15 @@ To be added. The reward model may be readily used for rejection sampling finetun
96
  journal={arXiv preprint arXiv:2304.06767},
97
  year={2023}
98
  }
 
 
 
 
 
 
 
 
 
99
  ```
100
 
101
 
 
86
 
87
  <!-- If there is a paper or blog post introducing the model, the APA and Bibtex information for that should go in this section. -->
88
 
89
+ The repo was part of the iterative rejection sampling fine-tuning and iterative DPO. If you find the content of this repo useful in your work, please consider cite it as follows:
90
 
91
 
92
  ```
 
96
  journal={arXiv preprint arXiv:2304.06767},
97
  year={2023}
98
  }
99
+
100
+ @misc{xiong2024iterative,
101
+ title={Iterative Preference Learning from Human Feedback: Bridging Theory and Practice for RLHF under KL-Constraint},
102
+ author={Wei Xiong and Hanze Dong and Chenlu Ye and Ziqi Wang and Han Zhong and Heng Ji and Nan Jiang and Tong Zhang},
103
+ year={2024},
104
+ eprint={2312.11456},
105
+ archivePrefix={arXiv},
106
+ primaryClass={cs.LG}
107
+ }
108
  ```
109
 
110