ARahul2003
/

lamini_flan_t5_detoxify_rlaif

Text Generation

text2text-generation

reinforcement-learning

LLM detoxification

text-generation-inference

Inference Endpoints

Model card Files Files and versions Community

ARahul2003 commited on Nov 8, 2023

Commit

df86f11

•

1 Parent(s): 2d50cb3

Update README.md

Files changed (1) hide show

README.md +1 -1

README.md CHANGED Viewed

@@ -21,7 +21,7 @@ This is a [TRL language model](https://github.com/huggingface/trl). It has been
  is a method to align models with a particular kind of data. RLHF creates a latent reward model using human feedback and finetunes
  a model using Proximal Policy Optimization. RLAIF on the other hand replaces human feedback with a high-performance AI agent. The model
  has been fine-tuned on the [Social Reasoning Dataset](https://huggingface.co/datasets/ProlificAI/social-reasoning-rlhf/viewer/default/train?p=38&row=3816) by
- ProlificAI for 191 steps and 1 epoch using the Proximal Policy Optimisation (PPO) algorithm. The [Roberta hate speech recognition](https://huggingface.co/facebook/roberta-hate-speech-dynabench-r4-target)
  model was used as the Proximal Policy Optimisation (PPO) reward model.
  The power of this model lies in its size; it is barely 500 MBs in size and performs well given its size. The intended use of this model should be conversation, text generation, or context-based Q&A.

  is a method to align models with a particular kind of data. RLHF creates a latent reward model using human feedback and finetunes
  a model using Proximal Policy Optimization. RLAIF on the other hand replaces human feedback with a high-performance AI agent. The model
  has been fine-tuned on the [Social Reasoning Dataset](https://huggingface.co/datasets/ProlificAI/social-reasoning-rlhf/viewer/default/train?p=38&row=3816) by
+ ProlificAI for 191 steps and 1 epoch using the Proximal Policy Optimisation (PPO) algorithm. The [Roberta hate text detection](https://huggingface.co/facebook/roberta-hate-speech-dynabench-r4-target)
  model was used as the Proximal Policy Optimisation (PPO) reward model.
  The power of this model lies in its size; it is barely 500 MBs in size and performs well given its size. The intended use of this model should be conversation, text generation, or context-based Q&A.