Problem with bloom-1b7 finetuning

#1
by gcamposampie - opened

Hi! First of all, thanks for sharing this wonderful model. I'm trying to fine-tune bloom-1b7 using Megatron-Deepspeed on a custom dataset, but I'm stuck on the following error, which I can't really understand.

File "/home/deepspeed/deepspeed/runtime/state_dict_factory.py", line 187, in check_ckpt_list
      assert len(self.ckpt_list) == sd['mp_world_size'], f"checkpoint count {len(self.ckpt_list)} is different from saved mp_world_size {sd['mp_world_size']}"
  AssertionError: checkpoint count 1 is different from saved mp_world_size 4

I'm not really sure why mp_world_size is set to 4, and if that is an attribute computed dynamically from the parameters of the training script or if it's already saved in the model.

The training script I'm using to fine-tune the model (adapted from bloomz) is the following.

#!/bin/bash

MODEL=$1 # MODEL

# define first check from which restart the training
# comment this line after the first successful run
if [ "$MODEL"="1b7" ]
then
    echo "global_step340500" > /path/checkpoints/latest # 1b7
else
    echo "global_step660750" > /path/checkpoints/latest # 1b3
fi

DATA_OUTPUT_PATH=/path
CHECKPOINT_PATH=$DATA_OUTPUT_PATH/checkpoints
REPO_PATH=$DATA_OUTPUT_PATH/tr13b-1B3
TENSORBOARD_PATH=$REPO_PATH/tensorboard
LOGS_PATH=$REPO_PATH/logs
mkdir -p $LOGS_PATH
mkdir -p $TENSORBOARD_PATH


MEGATRON_DEEPSPEED_REPO=/home/Megatron-DeepSpeed
cd $MEGATRON_DEEPSPEED_REPO

KILL_SWITCH_PATH=$MEGATRON_DEEPSPEED_REPO/kill-switch-tr13b-1B3-mtf

TRAIN_DATA_PATH=$MEGATRON_DEEPSPEED_REPO/data/train.txt
VALID_DATA_PATH=$MEGATRON_DEEPSPEED_REPO/data/validation.txt
TOKENIZER_NAME_OR_PATH=bigscience/tokenizer

# defining the right environment variables
export TRANSFORMERS_CACHE=$DATA_OUTPUT_PATH/models
export HF_DATASETS_CACHE=$DATA_OUTPUT_PATH/datasets
export HF_MODULES_CACHE=$DATA_OUTPUT_PATH/modules
export HF_METRICS_CACHE=$DATA_OUTPUT_PATH/metrics
export HF_DATASETS_OFFLINE=1
export TRANSFORMERS_OFFLINE=1


GPUS_PER_NODE=8
NNODES=1
PP_SIZE=2
TP_SIZE=2
MICRO_BATCH_SIZE=1
GLOBAL_BATCH_SIZE=8

NLAYERS=24
if [ "$MODEL"="1b7" ]
then
    NHIDDEN=2048
else
    NHIDDEN=1536
fi
NHEADS=16
SEQ_LEN=2048
SAVE_INTERVAL=10
TRAIN_SAMPLES=361

# Uncomment for the first step
# --no-load-optim \
# --reset-progress \
OPTIMIZER_ARGS=" \
    --optimizer adam \
    --adam-beta1 0.9 \
    --adam-beta2 0.95 \
    --adam-eps 1e-8 \
    --lr 2e-5 \
    --lr-decay-style constant \
    --lr-warmup-samples 0 \
    --clip-grad 1.0 \
    --weight-decay 1e-4 \
    --norm-target-loss \
    "

EXIT_OPTS=" \
    --exit-duration-in-mins 5990 \
    "

GPT_ARGS=" \
    --pp-partition-method 'type:transformer|embedding' \
    --num-layers $NLAYERS \
    --hidden-size $NHIDDEN \
    --num-attention-heads $NHEADS \
    --seq-length $SEQ_LEN \
    --max-position-embeddings $SEQ_LEN \
    --micro-batch-size $MICRO_BATCH_SIZE \
    --global-batch-size $GLOBAL_BATCH_SIZE \
    --train-samples $TRAIN_SAMPLES \
    --tokenizer-type PretrainedFromHF \
    --tokenizer-name-or-path $TOKENIZER_NAME_OR_PATH \
    --init-method-std 0.0048 \
    --embed-layernorm \
    --fp16 \
    --seed 42 \
    --position-embedding-type alibi \
    --checkpoint-activations \
    --abort-on-unmet-fused-kernel-constraints \
    --kill-switch-path $KILL_SWITCH_PATH \
    --pad-vocab-size-to 250880 \
    $OPTIMIZER_ARGS \
    $EXIT_OPTS \
    "

OUTPUT_ARGS=" \
    --log-interval 1 \
    --save-interval $SAVE_INTERVAL \
    --eval-interval 125 \
    --eval-iters 10 \
    --tensorboard-dir $TENSORBOARD_PATH \
    --tensorboard-queue-size 5 \
    --log-timers-to-tensorboard \
    --log-batch-size-to-tensorboard \
    --log-validation-ppl-to-tensorboard \
    "

ZERO_STAGE=1

config_json="./ds_config.$SLURM_JOBID.json"

# Deepspeed figures out GAS dynamically from dynamic GBS via set_train_batch_size()
cat <<EOT > $config_json
{
  "train_micro_batch_size_per_gpu": $MICRO_BATCH_SIZE,
  "train_batch_size": $GLOBAL_BATCH_SIZE,
  "gradient_clipping": 1.0,
  "zero_optimization": {
    "stage": $ZERO_STAGE
  },
  "fp16": {
    "enabled": true,
    "loss_scale": 0,
    "loss_scale_window": 500,
    "hysteresis": 2,
    "min_loss_scale": 1,
    "initial_scale_power": 12
  },
  "steps_per_print": 1000,
  "wall_clock_breakdown": false
}
EOT


DEEPSPEED_ARGS=" \
    --deepspeed \
    --deepspeed_config ${config_json} \
    --zero-stage ${ZERO_STAGE} \
    --deepspeed-activation-checkpointing \
    "

export LAUNCHER="python -u -m torch.distributed.run \
    --nproc_per_node $GPUS_PER_NODE \
    --nnodes $NNODES \
    --max_restarts 0 \
    --tee 3 \
    "

export CMD=" \
    `pwd`/finetune_t0.py \
    --tensor-model-parallel-size $TP_SIZE \
    --pipeline-model-parallel-size $PP_SIZE \
    $GPT_ARGS \
    $OUTPUT_ARGS \
    --save $CHECKPOINT_PATH \
    --load $CHECKPOINT_PATH \
    --train-weighted-split-paths-path $TRAIN_DATA_PATH \
    --valid-weighted-split-paths-path $VALID_DATA_PATH \
    --dataloader-type single \
    --data-impl mmap \
    --distributed-backend nccl \
     $DEEPSPEED_ARGS \
    "

echo $CMD

# do not remove or the training will hang and nodes will be lost w/o this workaround
export CUDA_LAUNCH_BLOCKING=1

# hide duplicated errors using this hack - will be properly fixed in pt-1.12
export TORCHELASTIC_ERROR_FILE=/tmp/torch-elastic-error.json

clear; srun --nodes=$NNODES --ntasks-per-node=1 --gres=gpu:$GPUS_PER_NODE bash -c "$LAUNCHER $CMD" 2>&1 | tee -a $LOGS_PATH/main_log.txt
BigScience Workshop org

Hi! The problem is likely that you're trying to finetune it with a different shape than the saved checkpoint.

The 1B7 script: https://github.com/bigscience-workshop/bigscience/blob/master/train/tr11-176B-ml/smaller_models/tr11b-1B3-ml.slurm

Comparing it with your script it seems like TP & PP are the same, but your DP is different. 1B7 was trained on 16 nodes (#SBATCH --nodes=16), so DP is 16 * 8 / (2 * 2) = 32, while your DP seems to be 1 * 8 / (2 * 2) = 2, as it looks like you're running on one node?
If you can't scale to the same DP by increasing your nodes, you can also drop the optimizer states with --no-load-optim & just start with a new optimizer from scratch. We also reset the optimizer for bloomz finetuning.

Thanks for your swift feedback! Unfortunately I don't have the hardware resources to start the training on 16 different nodes, and I can only run the fine-tuning on a single node with multiple gpus.

I've now tried to follow your suggestion and drop the optimizer states using the option --no-load-optim, but it doesn't really work. It still tries to load mp_rank_00_model_states.pt and fail because of the mismatch between the real and expected number of checkpoint files.

The only solution I found to avoid loading the optimizer states is to remove latest from the checkpoint directory. This actually allows the training to start. However, is this equivalent to fine-tuning the pre-trained model or am I, in this case, training bloom architecture from scratch?

BigScience Workshop org

Yeah without latest you are training from scratch. You may also need to add --reset-progress and maybe there's some modification in DeepSpeed / MegDS needed to skip the optimizer loading (it should be possible without).

I think @razent made it work without loading the optimizer states, maybe he wants to share? 🤗

yes, I also tried adding --reset-progress as in the SLURM script you linked before but it didn't help unfortunately.

BigScience Workshop org

Hi,
You should add both --reset-progress and --no-load-optim to the finetuning script and create the latest file as well. The latest file contains the checkpoint's global step (i.g., global_step10000). It will load the model checkpoint without optimizer loading.

Thanks @razent for your answer!

Unfortunately I'm already using this configuration (both arguments and latest containing the checkpoint's global step) but the tuning still fails.
Is there any other way to fix the mismatch between the mp_world_size (4, and I don't quite get what this parameter should represent) and the actual number of checkpoints shared in your repository (only 1)?

I went through the code more carefully, and I got the following insights:

  • --no-load-optimizer and --reset-progress are working fine, as all the flags along the way are correctly set
  • the script get stuck when the initialization of the SDLoaderBase, more precisely in the check_ckpt_list()method
  • here, the checkpoint list for the model states is ['checkpoints/global_step340500/mp_rank_00_model_states.pt'], while the loaded model has a mp_world_size equal to 4 (for what I understood is expecting 4 files instead of just 1)
  • I tried to hard-code a different mp_world_size (setting it to len(self.ckpt_list)); however, it fails right after when loading the first layer (layer 01) with the following error "AssertionError: key: attention.dense.weight is not found in the checkpoint checkpoints/global_step340500/layer_01-model_00-model_states.pt"

At this point I'm wondering, is the mp_rank_00_model_states.pt pickle pushed to this repo the correct file? Am I missing something?

BigScience Workshop org
edited Dec 19, 2022

I went through the code more carefully, and I got the following insights:

  • --no-load-optimizer and --reset-progress are working fine, as all the flags along the way are correctly set
  • the script get stuck when the initialization of the SDLoaderBase, more precisely in the check_ckpt_list()method
  • here, the checkpoint list for the model states is ['checkpoints/global_step340500/mp_rank_00_model_states.pt'], while the loaded model has a mp_world_size equal to 4 (for what I understood is expecting 4 files instead of just 1)
  • I tried to hard-code a different mp_world_size (setting it to len(self.ckpt_list)); however, it fails right after when loading the first layer (layer 01) with the following error "AssertionError: key: attention.dense.weight is not found in the checkpoint checkpoints/global_step340500/layer_01-model_00-model_states.pt"

At this point I'm wondering, is the mp_rank_00_model_states.pt pickle pushed to this repo the correct file? Am I missing something?

Pretty sure it's the correct file - You can also easily check the contents of the files btw:

import torch
x = torch.load("mp_rank_00_model_states.pt")
x.keys()

I think you just need to change some code in DeepSpeed where it tries to load the optimizer checkpoint despite being given the --no-load-optim instruction.

But the problem is not just with the optimizer state, right? The second exception I get (AssertionError: key: attention.dense.weight is not found in the checkpoint checkpoints/global_step340500/layer_01-model_00-model_states.pt) is still related to it?

BigScience Workshop org

But the problem is not just with the optimizer state, right? The second exception I get (AssertionError: key: attention.dense.weight is not found in the checkpoint checkpoints/global_step340500/layer_01-model_00-model_states.pt) is still related to it?

I think that error is because when hardcoding a different mp_world_size, you may be messing with the PP & TP configurations, so it looks for the attention.dense.weight in the wrong file.

Yes, you might be right. In the end I did not manage to find a solution for the issue, so I switched to bloom-3b. The fine tuning script for the latter worked like a charm, so I'm quite convinced that it wasn't an issue on my side (but maybe I was just unlucky with sizes and tuning setting). Thanks anyway for the support!

@gcamposampie could you share any edits you made to get bloom-3b working? I have been getting the same error you had for bloom-1b7 but when I switched to bloom-3b I am still erroring out

@kschneier I just updated the following parameters according to bloom-3b specs:

NLAYERS=30
NHEADS=32
NHIDDEN=2560

and set latest to "global_step337250", downloading the corresponding checkpoint from HF

Sign up or log in to comment