can not run sft full finetuning.

#74
by hegang126 - opened

as so far, I tried sft full finetuning with deepspeed zero3 on A100 80G GPU, which will hang unti NCCL socket timeout in 30 minutes. Also When I tried lora with deepspeed2, which will fail in OOM, while lora with deepspeed zero3 will hang too!

Only with lora and quantization 4 bit will succeed training.

Same issue, any workaround?

Same issue, any workaround?

Same issue, any workaround?

Any update?

try
from deepspeed.utils import set_z3_leaf_modules
from transformers.models.mixtral.modeling_mixtral import MixtralSparseMoeBlock
set_z3_leaf_modules(model, [MixtralSparseMoeBlock])

@A-Cepheus thanks for reply, the inference now is successful and loss is return, however, the training still hang in backward, any clues?

Any update?

I believe deepspeed need all experts weight to be involved during inference so that zero3 could correctly sync data btw gpus. If all 8 experts are enabled inside config.json, the problem goes away.

@A-Cepheus thanks for reply, the inference now is successful and loss is return, however, the training still hang in backward, any clues?

when i train mixtral model, after 270 step, it will be hang . and GPU 100% until NCCL timeout

Sign up or log in to comment