arun-shankar/GPT2-RLHF-covid

GPT2 fine-tuned with COVID-19 question and answer pairs using Reinforcement Learning with Human Feedback (RLHF) and Proximal Policy Optimization (PPO).

Uses PPO and TRL library to align the response based on BERTScore towards the expected response.

You can ask the model any question related to COVID-19 in this format:

question: should i wear a mask at home?\nanswer:

You can also add a CTRL token as a special prefix token in front of your prompt to align it to your preference. For example, you either add a [good] or [bad] token as a prefix.

[good]question: should i wear a mask at school?\nanswer:

Good and bad here refers to the quality of the response. Being more aligned to the expected (ground truth) response is good and not is bad.