Bigger RLHF model with the reward model from [annaovesnaatatt/reward-model](https://huggingface.co/annaovesnaatatt/reward-model). The reward model was trained on 3000 examples from the [Anthropic/hh-rlhf](https://huggingface.co/datasets/Anthropic/hh-rlhf) dataset.