--- license: mit language: - en --- # LM Loss OPT RM This is a fine tuned OPT 13b model for reward modelling. The finetuning has been done on top of the full [SLF5K](https://huggingface.co/datasets/JeremyAlain/SLF5K) dataset following the method presented in the paper [Training Language Models with Language Feedback at Scale](https://arxiv.org/abs/2303.16755). The main results can be seen in the following table: | Model | # Params | Validation Accuracy (in %) | |--------------------|-----------|-------------------| | OPT LM Loss | 13B | **73.4 +/- 1.9** | | OPT LM Loss | 1.3B | 69.6 +/- 2.0 | | OPT RM Loss | 13B | 71.8 +/- 2.0 | If using this model, please cite the following paper: ``` @article{scheurer2023training, title={Training Language Models with Language Feedback at Scale}, author={Scheurer, J{\'e}r{\'e}my and Campos, Jon Ander and Korbak, Tomasz and Chan, Jun Shern and Chen, Angelica and Cho, Kyunghyun and Perez, Ethan}, journal={arXiv preprint arXiv:2303.16755}, year={2023} } ```