hamishivi commited on
Commit
cdb43da
1 Parent(s): 83c767e

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +2 -1
README.md CHANGED
@@ -21,7 +21,7 @@ Tulu V2.5 is a series of models trained using DPO and PPO starting from the [Tul
21
  To train this model, we used a 70B RM trained on the UltraFeedback data, and then used a mixture of prompts during PPO training.
22
 
23
  For more details, read the paper:
24
- [Unpacking DPO and PPO: Disentangling Best Practices for Learning from Preference Feedback](https://link.todo).
25
 
26
 
27
  ## .Model description
@@ -82,6 +82,7 @@ If you find Tulu 2.5 is useful in your work, please cite it with:
82
  title={{Unpacking DPO and PPO: Disentangling Best Practices for Learning from Preference Feedback}},
83
  author={{Hamish Ivison and Yizhong Wang and Jiacheng Liu and Ellen Wu and Valentina Pyatkin and Nathan Lambert and Yejin Choi and Noah A. Smith and Hannaneh Hajishirzi}}
84
  year={2024},
 
85
  archivePrefix={arXiv},
86
  primaryClass={cs.CL}
87
  }
 
21
  To train this model, we used a 70B RM trained on the UltraFeedback data, and then used a mixture of prompts during PPO training.
22
 
23
  For more details, read the paper:
24
+ [Unpacking DPO and PPO: Disentangling Best Practices for Learning from Preference Feedback](https://arxiv.org/abs/2406.09279).
25
 
26
 
27
  ## .Model description
 
82
  title={{Unpacking DPO and PPO: Disentangling Best Practices for Learning from Preference Feedback}},
83
  author={{Hamish Ivison and Yizhong Wang and Jiacheng Liu and Ellen Wu and Valentina Pyatkin and Nathan Lambert and Yejin Choi and Noah A. Smith and Hannaneh Hajishirzi}}
84
  year={2024},
85
+ eprint={2406.09279},
86
  archivePrefix={arXiv},
87
  primaryClass={cs.CL}
88
  }