AdamG012's picture
Update README.md
fe9adb1
|
raw
history blame
4.43 kB
metadata
language:
  - en
tags:
  - deepspeed
  - chatgpt
  - pytorch
  - opt
  - reward-model
license: apache-2.0
datasets:
  - Dahoas/full-hh-rlhf
  - Dahoas/synthetic-instruct-gptj-pairwise
  - yitingxie/rlhf-reward-datasets
  - openai/webgpt_comparisons
  - stanfordnlp/SHP

ChatGPT OPT 350M DeepSpeed Reward Model

chat-opt-350m-reward-deepspeed

This model consists of the second step of a modified pipeline the to the traditional training process of Chat-GPT models, which is comprised of a three-step procedure of supervised fine tuning, reward model and reinforcement learning from human feedback models; actor, actor EMA and critic models.

This project's main goal was to make proper use of existing frameworks that revolve around the minimisation of training costs and thus the eventual improvements towards both the feasibility and usability of ChatGPT-like models. The framework selected here is DeepSpeed which has been instrumental in the development of this model and through this framework it was possible to train the ChatGPT-like model on much larger data-sets with a reasonable number of GPUs and consequently achieve significantly better performance.

This model follows the blog of ChatGPT and the paper of InstructGPT and especially the Microsoft DeepSpeed Chat Blog.

Our Training Methodology and Speedup Recipes

The training process simply involves a single python run of DeepSpeed-Chat which initiates the whole 3-step pipeline, saving all models in the process:

python train.py --actor-model facebook/opt-1.3b --reward-model facebook/opt-350m --deployment-type single_node

This pipeline can be broken up into three key steps:

  1. Supervised fine-tuning (SFT): See here.
  2. Reward Model (RM) fine-tuning: In parallel or after the model has been trained under supervised conditions, the RM fine tuning step takes the pre-trained models (or the model trained from step 1, if you choose so) and uses small learning rates that were tuned on the data-set with comparisons (accept and reject).
  3. Reinforcement-learning from Human feedback (RLHF) fine-tuning: At the completion of the prior two steps, the final RLHF fine-tuning can be initiated. This involves the collection of both the fine-tuned model from step 1 and the reward model from step 2 and train them on the data-set with comparisons. This generates both an actor and critic. I also generate an actor model with an exponential moving average (EMA) which is known to improve conversational response quality.

To view the details behind each step head into their respective links and view the model card there.

Reward Model Configurations

Model Configurations:

Parameter Value
Parameters 350M
Model type OPT
FFN Dimensions 4096
Hidden Size 1024
Max Position Embedding 2048
Attention Heads 16
Hidden layers 24

Training Configurations:

Parameter Value
Train Batch size 64
Train micro batch size 8
ZeRO stage 0
FP16 True
Gradient clipping 1.0
Dropout 0.1
Prescale gradients True

Acknowledgements

We thank the following papers and open-source repositories. We especially thank DeepSpeed for their frameworks as well.