can not run sft full finetuning.

#74

by hegang126 - opened Jan 3, 2024

Jan 3, 2024

as so far, I tried sft full finetuning with deepspeed zero3 on A100 80G GPU, which will hang unti NCCL socket timeout in 30 minutes. Also When I tried lora with deepspeed2, which will fail in OOM, while lora with deepspeed zero3 will hang too!

Only with lora and quantization 4 bit will succeed training.

JiaxinTsao

Jan 4, 2024

Same issue, any workaround?

guowl

Jan 8, 2024

Same issue, any workaround?

guowl

Jan 12, 2024

Same issue, any workaround?

zxs1997zju

Jan 22, 2024

Any update?

A-Cepheus

Jan 22, 2024

try
from deepspeed.utils import set_z3_leaf_modules
from transformers.models.mixtral.modeling_mixtral import MixtralSparseMoeBlock
set_z3_leaf_modules(model, [MixtralSparseMoeBlock])

zxs1997zju

Jan 23, 2024

@A-Cepheus thanks for reply, the inference now is successful and loss is return, however, the training still hang in backward, any clues?

zxs1997zju

Jan 23, 2024

Any update?

JiaxinTsao

Jan 23, 2024

I believe deepspeed need all experts weight to be involved during inference so that zero3 could correctly sync data btw gpus. If all 8 experts are enabled inside config.json, the problem goes away.

guowl

Jan 25, 2024

@A-Cepheus thanks for reply, the inference now is successful and loss is return, however, the training still hang in backward, any clues?

when i train mixtral model, after 270 step, it will be hang . and GPU 100% until NCCL timeout

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment