amirabdullah19852020
/

interpreting_reward_models

Model card Files Files and versions Community

interpreting_reward_models / README.md

amirabdullah19852020's picture

amirabdullah19852020

Create README.md

9cbba88 verified 2 months ago

|

No virus

289 Bytes

	---
	license: mit
	datasets:
	- unalignment/toxic-dpo-v0.2
	- Anthropic/hh-rlhf
	- stanfordnlp/imdb
	language:
	- en
	---

	We train a collection of models under RLHF on the above datasets. We use DPO for hh-rlhf and unalignment, and train a PPO on completing IMDB prefixes with positive sentiment.