--- license: mit datasets: - unalignment/toxic-dpo-v0.2 - Anthropic/hh-rlhf - stanfordnlp/imdb language: - en --- We train a collection of models under RLHF on the above datasets. We use DPO for hh-rlhf and unalignment, and train a PPO on completing IMDB prefixes with positive sentiment.