metadata

language:
  - en
tags:
  - deepspeed
  - chatgpt
  - pytorch
  - opt
  - reward-model
license: apache-2.0

FSA ChatGPT OPT 350M DeepSpeed Reward Model

fsalab-chat-opt-350m-reward-deepspeed

This model consists of the second step of a modified pipeline the to the traditional training process of Chat-GPT models, which is comprised of a three-step procedure of supervised fine tuning, reward model and RLHF.

This project's main goal was to make proper use of existing frameworks that revolve around the minimisation of training costs and thus the eventual improvements towards both the feasibility and usability of ChatGPT-like models. The framework selected here is DeepSpeed which has been instrumental in the development of this model and through this framework it was possible to train the ChatGPT-like model on much larger data-sets with a reasonable number of GPUs and consequently achieve significantly better performance.

This model follows the blog of ChatGPT and the paper of InstructGPT and especially the Microsoft DeepSpeed Chat Blog.

Our Training Methodology and Speedup Recipes

The training process is broken up into three key steps:

Supervised fine-tuning (SFT): See here
Reward Model (RM) fine-tuning: In parallel or after the model has been trained under supervised conditions, the RM fine tuning step takes the pre-trained models (or the model trained from step 1, if you choose so) and uses small learning rates that were tuned on the data-set with comparisons (accept and reject).
Reinforcement-learning from Human feedback (RLHF) fine-tuning: See here

To view the details behind each step head into their respective links and view the model card there.

Reward Model Configurations

Model Configurations:

Parameter	Value
Parameters	350M
Model type	OPT
FFN Dimensions	4096
Hidden Size	1024
Max Position Embedding	2048
Attention Heads	16
Hidden layers	24

Training Configurations:

Parameter	Value
Train Batch size	64
Train micro batch size	8
ZeRO stage	0
FP16	True
Gradient clipping	1.0
Dropout	0.1
Prescale gradients	True

Why did we choose DeepSpeed?

DeepSpeed Training:

The main.py Python code take the DeepSpeed config with the argument --deepspeed_config ./ds_config.json.

We read up on the DeepSpeed documentation and created a specific coniguration based on their work. The json file ds_config.json here is set to take the ZeRO-2 stage and FP16, allowing must faster training and GPU memory saving. Note that ZeRO-2 is just one of the examples using our DeepSpeed. You may use ZeRO-1, Zero-3, ZeRO-Offload and ZeRO-infinity. For more information on DeepSpeed ZeRO family, please see this tutorial link for Zero-1/2/3 and this tutorial for Zero-Offload.

To enable the DeepSpeed Zero family training, we injected several lines of code in order to enable this i.e.:

model, optimizer, _, lr_scheduler = deepspeed.initialize(model=model, \
  optimizer=optimizer, \
  args=args,       \
  lr_scheduler=lr_scheduler,  \
  dist_init_required=True)

Acknowledgements

We thank the following papers and open-source repositories. We especially thank DeepSpeed for their frameworks as well.

[1] Schulman, John, et al. "Introducing ChatGPT", https://openai.com/blog/chatgpt (2022).
[2] Transformers Hugging Face (github.com)
[3] DeepSpeed Chat DeepSpeed Chat