AdamG012
/

chat-opt-1.3b-sft-deepspeed

Text Generation

text-generation-inference

Inference Endpoints

Model card Files Files and versions Community

AdamG012 commited on Apr 25, 2023

Commit

e79fa60

•

1 Parent(s): 75aa9a0

Update README.md

Files changed (1) hide show

README.md +1 -1

README.md CHANGED Viewed

@@ -40,7 +40,7 @@ This pipeline can be broken up into three key steps:
 2. **Reward Model (RM) fine-tuning:** See  [here](https://huggingface.co/FSALab/fsalab-chat-opt-350m-reward-deepspeed)
-3. **Reinforcement-learning from Human feedback (RLHF) fine-tuning:** At the completion of the prior two steps, the final RLHF fine-tuning can be initiated. This involves the collection of both the *fine-tuned model* from step 1 and the *reward model** from step 2 and train them on the data-set with comparisons. This generates both an [actor](https://huggingface.co/FSALab/fsalab-chat-opt-1.3b-rlhf-actor-deepspeed) and [critic](https://huggingface.co/FSALab/fsalab-chat-opt-1.3b-rlhf-actor-deepspeed).
 To view the details behind each step head into their respective links and view the model card there.

 2. **Reward Model (RM) fine-tuning:** See  [here](https://huggingface.co/FSALab/fsalab-chat-opt-350m-reward-deepspeed)
+3. **Reinforcement-learning from Human feedback (RLHF) fine-tuning:** At the completion of the prior two steps, the final RLHF fine-tuning can be initiated. This involves the collection of both the *fine-tuned model* from step 1 and the *reward model** from step 2 and train them on the data-set with comparisons. This generates both an [actor](https://huggingface.co/FSALab/fsalab-chat-opt-1.3b-rlhf-actor-deepspeed) and [critic](https://huggingface.co/FSALab/fsalab-chat-opt-1.3b-rlhf-actor-deepspeed). . This generates both an actor and critic model. I also generate an actor model with an exponential moving average (EMA) which is known to improve conversational response quality.
 To view the details behind each step head into their respective links and view the model card there.