模型并行出错并给出修改方案
#54
by
yuanzhoulvpi
- opened
在进行模型并行的时候,如果不加这行的代码,会报错,建议加上。
yuanzhoulvpi
changed pull request title from
Update modeling_chatglm.py
to 模型并行出错
yuanzhoulvpi
changed pull request title from
模型并行出错
to 模型并行出错并给出修改方案
在代码部分:
# Shift so that tokens < n predict n
shift_logits = lm_logits[..., :-1, :].contiguous()
shift_labels = labels[..., 1:].contiguous()#.to(shift_logits.device)
# Flatten the tokens
采用的训练方案是https://github.com/yuanzhoulvpi2017/zero_nlp/tree/main/chatglm_v2_6b_lora
在不添加那个代码,多卡训练会报错
报错位置:loss = loss_fct(shift_logits.view(-1, shift_logits.size(-1)), shift_labels.view(-1))
发生异常: RuntimeError
Expected all tensors to be on the same device, but found at least two devices, cuda:1 and cuda:0! (when checking argument for argument target in method wrapper_CUDA_nll_loss_forward)
File "/home/yuanz/.cache/huggingface/modules/transformers_modules/chatglm2-6b_model/modeling_chatglm.py", line 958, in forward
loss = loss_fct(shift_logits.view(-1, shift_logits.size(-1)), shift_labels.view(-1))
File "/media/yuanz/新加卷/训练代码/chatglm6b_v2_0716/main.py", line 371, in main
train_result = trainer.train(resume_from_checkpoint=checkpoint)
File "/media/yuanz/新加卷/训练代码/chatglm6b_v2_0716/main.py", line 432, in <module>
main()
RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:1 and cuda:0! (when checking argument for argument target in method wrapper_CUDA_nll_loss_forward)
在添加那个代码后,可以跑起来
[INFO|trainer.py:1786] 2023-07-17 21:18:51,711 >> ***** Running training *****
[INFO|trainer.py:1787] 2023-07-17 21:18:51,711 >> Num examples = 114,599
[INFO|trainer.py:1788] 2023-07-17 21:18:51,711 >> Num Epochs = 1
[INFO|trainer.py:1789] 2023-07-17 21:18:51,711 >> Instantaneous batch size per device = 1
[INFO|trainer.py:1790] 2023-07-17 21:18:51,711 >> Total train batch size (w. parallel, distributed & accumulation) = 16
[INFO|trainer.py:1791] 2023-07-17 21:18:51,711 >> Gradient Accumulation steps = 16
[INFO|trainer.py:1792] 2023-07-17 21:18:51,711 >> Total optimization steps = 3,000
[INFO|trainer.py:1793] 2023-07-17 21:18:51,714 >> Number of trainable parameters = 7,798,784
0%| | 0/3000 [00:00<?, ?it/s]07/17/2023 21:18:51 - WARNING - transformers_modules.chatglm2-6b_model.modeling_chatglm - `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...
{'loss': 6.0639, 'learning_rate': 1.9933333333333334e-05, 'epoch': 0.0}
{'loss': 6.0275, 'learning_rate': 1.9866666666666667e-05, 'epoch': 0.0}
{'loss': 5.8822, 'learning_rate': 1.98e-05, 'epoch': 0.0}
{'loss': 5.5463, 'learning_rate': 1.9733333333333336e-05, 'epoch': 0.01}
{'loss': 5.028, 'learning_rate': 1.9666666666666666e-05, 'epoch': 0.01}
{'loss': 4.534, 'learning_rate': 1.9600000000000002e-05, 'epoch': 0.01}
修改方法
# 在modeling_chatglm.py代码的955行左右
# Shift so that tokens < n predict n
shift_logits = lm_logits[..., :-1, :].contiguous()
shift_labels = labels[..., 1:].contiguous().to(shift_logits.device) # 在这里添加上即可
# Flatten the tokens