ARahul2003 commited on
Commit
df86f11
1 Parent(s): 2d50cb3

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +1 -1
README.md CHANGED
@@ -21,7 +21,7 @@ This is a [TRL language model](https://github.com/huggingface/trl). It has been
21
  is a method to align models with a particular kind of data. RLHF creates a latent reward model using human feedback and finetunes
22
  a model using Proximal Policy Optimization. RLAIF on the other hand replaces human feedback with a high-performance AI agent. The model
23
  has been fine-tuned on the [Social Reasoning Dataset](https://huggingface.co/datasets/ProlificAI/social-reasoning-rlhf/viewer/default/train?p=38&row=3816) by
24
- ProlificAI for 191 steps and 1 epoch using the Proximal Policy Optimisation (PPO) algorithm. The [Roberta hate speech recognition](https://huggingface.co/facebook/roberta-hate-speech-dynabench-r4-target)
25
  model was used as the Proximal Policy Optimisation (PPO) reward model.
26
 
27
  The power of this model lies in its size; it is barely 500 MBs in size and performs well given its size. The intended use of this model should be conversation, text generation, or context-based Q&A.
 
21
  is a method to align models with a particular kind of data. RLHF creates a latent reward model using human feedback and finetunes
22
  a model using Proximal Policy Optimization. RLAIF on the other hand replaces human feedback with a high-performance AI agent. The model
23
  has been fine-tuned on the [Social Reasoning Dataset](https://huggingface.co/datasets/ProlificAI/social-reasoning-rlhf/viewer/default/train?p=38&row=3816) by
24
+ ProlificAI for 191 steps and 1 epoch using the Proximal Policy Optimisation (PPO) algorithm. The [Roberta hate text detection](https://huggingface.co/facebook/roberta-hate-speech-dynabench-r4-target)
25
  model was used as the Proximal Policy Optimisation (PPO) reward model.
26
 
27
  The power of this model lies in its size; it is barely 500 MBs in size and performs well given its size. The intended use of this model should be conversation, text generation, or context-based Q&A.