ARahul2003 commited on
Commit
2d50cb3
1 Parent(s): 2946e2f

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +3 -3
README.md CHANGED
@@ -20,14 +20,14 @@ This is a [TRL language model](https://github.com/huggingface/trl). It has been
20
  Reinforcement Learning with Artificial Intelligence Feedback technique (RLAIF). Reinforcement Learning with Human Feedback (RLHF)
21
  is a method to align models with a particular kind of data. RLHF creates a latent reward model using human feedback and finetunes
22
  a model using Proximal Policy Optimization. RLAIF on the other hand replaces human feedback with a high-performance AI agent. The model
23
- has been fine-tuned on the [Social Reasoning Dataset] (https://huggingface.co/datasets/ProlificAI/social-reasoning-rlhf/viewer/default/train?p=38&row=3816) by
24
- ProlificAI for 191 steps and 1 epoch using the Proximal Policy Optimisation (PPO) algorithm. The [Roberta hate speech recognition] (https://huggingface.co/facebook/roberta-hate-speech-dynabench-r4-target)
25
  model was used as the Proximal Policy Optimisation (PPO) reward model.
26
 
27
  The power of this model lies in its size; it is barely 500 MBs in size and performs well given its size. The intended use of this model should be conversation, text generation, or context-based Q&A.
28
  This model might not perform well on tasks like mathematics, sciences, coding, etc. It might hallucinate on such tasks. After quantization, this model could be easily run on edge devices like smartphones and microprocessors.
29
 
30
- The training log of the model can be found in this [weights and biases] (https://wandb.ai/tnarahul/trl/runs/nk30wukt/overview?workspace=user-tnarahul) page.
31
 
32
  Note: This model is a fine-tuned version of [LaMini Flan T5 248M](https://huggingface.co/MBZUAI/LaMini-Flan-T5-248M), which in turn is a fine-tuned version of the Flan T5 model released by Google. The Flan T5 follows the encoder-decoder architecture, unlike other GPT-like models that are decoder-only.
33
 
 
20
  Reinforcement Learning with Artificial Intelligence Feedback technique (RLAIF). Reinforcement Learning with Human Feedback (RLHF)
21
  is a method to align models with a particular kind of data. RLHF creates a latent reward model using human feedback and finetunes
22
  a model using Proximal Policy Optimization. RLAIF on the other hand replaces human feedback with a high-performance AI agent. The model
23
+ has been fine-tuned on the [Social Reasoning Dataset](https://huggingface.co/datasets/ProlificAI/social-reasoning-rlhf/viewer/default/train?p=38&row=3816) by
24
+ ProlificAI for 191 steps and 1 epoch using the Proximal Policy Optimisation (PPO) algorithm. The [Roberta hate speech recognition](https://huggingface.co/facebook/roberta-hate-speech-dynabench-r4-target)
25
  model was used as the Proximal Policy Optimisation (PPO) reward model.
26
 
27
  The power of this model lies in its size; it is barely 500 MBs in size and performs well given its size. The intended use of this model should be conversation, text generation, or context-based Q&A.
28
  This model might not perform well on tasks like mathematics, sciences, coding, etc. It might hallucinate on such tasks. After quantization, this model could be easily run on edge devices like smartphones and microprocessors.
29
 
30
+ The training log of the model can be found in this [weights and biases](https://wandb.ai/tnarahul/trl/runs/nk30wukt/overview?workspace=user-tnarahul) page.
31
 
32
  Note: This model is a fine-tuned version of [LaMini Flan T5 248M](https://huggingface.co/MBZUAI/LaMini-Flan-T5-248M), which in turn is a fine-tuned version of the Flan T5 model released by Google. The Flan T5 follows the encoder-decoder architecture, unlike other GPT-like models that are decoder-only.
33