ARahul2003
commited on
Commit
•
df86f11
1
Parent(s):
2d50cb3
Update README.md
Browse files
README.md
CHANGED
@@ -21,7 +21,7 @@ This is a [TRL language model](https://github.com/huggingface/trl). It has been
|
|
21 |
is a method to align models with a particular kind of data. RLHF creates a latent reward model using human feedback and finetunes
|
22 |
a model using Proximal Policy Optimization. RLAIF on the other hand replaces human feedback with a high-performance AI agent. The model
|
23 |
has been fine-tuned on the [Social Reasoning Dataset](https://huggingface.co/datasets/ProlificAI/social-reasoning-rlhf/viewer/default/train?p=38&row=3816) by
|
24 |
-
ProlificAI for 191 steps and 1 epoch using the Proximal Policy Optimisation (PPO) algorithm. The [Roberta hate
|
25 |
model was used as the Proximal Policy Optimisation (PPO) reward model.
|
26 |
|
27 |
The power of this model lies in its size; it is barely 500 MBs in size and performs well given its size. The intended use of this model should be conversation, text generation, or context-based Q&A.
|
|
|
21 |
is a method to align models with a particular kind of data. RLHF creates a latent reward model using human feedback and finetunes
|
22 |
a model using Proximal Policy Optimization. RLAIF on the other hand replaces human feedback with a high-performance AI agent. The model
|
23 |
has been fine-tuned on the [Social Reasoning Dataset](https://huggingface.co/datasets/ProlificAI/social-reasoning-rlhf/viewer/default/train?p=38&row=3816) by
|
24 |
+
ProlificAI for 191 steps and 1 epoch using the Proximal Policy Optimisation (PPO) algorithm. The [Roberta hate text detection](https://huggingface.co/facebook/roberta-hate-speech-dynabench-r4-target)
|
25 |
model was used as the Proximal Policy Optimisation (PPO) reward model.
|
26 |
|
27 |
The power of this model lies in its size; it is barely 500 MBs in size and performs well given its size. The intended use of this model should be conversation, text generation, or context-based Q&A.
|