模型并行出错并给出修改方案

#54

在进行模型并行的时候,如果不加这行的代码,会报错,建议加上。

yuanzhoulvpi changed pull request title from Update modeling_chatglm.py to 模型并行出错
yuanzhoulvpi changed pull request title from 模型并行出错 to 模型并行出错并给出修改方案

在代码部分:

            # Shift so that tokens < n predict n
            shift_logits = lm_logits[..., :-1, :].contiguous()
            shift_labels = labels[..., 1:].contiguous()#.to(shift_logits.device)
            # Flatten the tokens

采用的训练方案是https://github.com/yuanzhoulvpi2017/zero_nlp/tree/main/chatglm_v2_6b_lora

在不添加那个代码,多卡训练会报错

报错位置:loss = loss_fct(shift_logits.view(-1, shift_logits.size(-1)), shift_labels.view(-1))



发生异常: RuntimeError
Expected all tensors to be on the same device, but found at least two devices, cuda:1 and cuda:0! (when checking argument for argument target in method wrapper_CUDA_nll_loss_forward)
  File "/home/yuanz/.cache/huggingface/modules/transformers_modules/chatglm2-6b_model/modeling_chatglm.py", line 958, in forward
    loss = loss_fct(shift_logits.view(-1, shift_logits.size(-1)), shift_labels.view(-1))
  File "/media/yuanz/新加卷/训练代码/chatglm6b_v2_0716/main.py", line 371, in main
    train_result = trainer.train(resume_from_checkpoint=checkpoint)
  File "/media/yuanz/新加卷/训练代码/chatglm6b_v2_0716/main.py", line 432, in <module>
    main()
RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:1 and cuda:0! (when checking argument for argument target in method wrapper_CUDA_nll_loss_forward)

在添加那个代码后,可以跑起来

[INFO|trainer.py:1786] 2023-07-17 21:18:51,711 >> ***** Running training *****
[INFO|trainer.py:1787] 2023-07-17 21:18:51,711 >>   Num examples = 114,599
[INFO|trainer.py:1788] 2023-07-17 21:18:51,711 >>   Num Epochs = 1
[INFO|trainer.py:1789] 2023-07-17 21:18:51,711 >>   Instantaneous batch size per device = 1
[INFO|trainer.py:1790] 2023-07-17 21:18:51,711 >>   Total train batch size (w. parallel, distributed & accumulation) = 16
[INFO|trainer.py:1791] 2023-07-17 21:18:51,711 >>   Gradient Accumulation steps = 16
[INFO|trainer.py:1792] 2023-07-17 21:18:51,711 >>   Total optimization steps = 3,000
[INFO|trainer.py:1793] 2023-07-17 21:18:51,714 >>   Number of trainable parameters = 7,798,784
  0%|                                                                                                                                                                             | 0/3000 [00:00<?, ?it/s]07/17/2023 21:18:51 - WARNING - transformers_modules.chatglm2-6b_model.modeling_chatglm - `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...
{'loss': 6.0639, 'learning_rate': 1.9933333333333334e-05, 'epoch': 0.0}                                                                                                                                    
{'loss': 6.0275, 'learning_rate': 1.9866666666666667e-05, 'epoch': 0.0}                                                                                                                                    
{'loss': 5.8822, 'learning_rate': 1.98e-05, 'epoch': 0.0}                                                                                                                                                  
{'loss': 5.5463, 'learning_rate': 1.9733333333333336e-05, 'epoch': 0.01}                                                                                                                                   
{'loss': 5.028, 'learning_rate': 1.9666666666666666e-05, 'epoch': 0.01}                                                                                                                                    
{'loss': 4.534, 'learning_rate': 1.9600000000000002e-05, 'epoch': 0.01}  

修改方法

# 在modeling_chatglm.py代码的955行左右
            # Shift so that tokens < n predict n
            shift_logits = lm_logits[..., :-1, :].contiguous()
            shift_labels = labels[..., 1:].contiguous().to(shift_logits.device) # 在这里添加上即可
            # Flatten the tokens
Ready to merge
This branch is ready to get merged automatically.

Sign up or log in to comment