Text Generation
Transformers
PyTorch
English
mixtral
conversational
Inference Endpoints
text-generation-inference

Weird fine-tuning problem

#2
by joorei - opened

Hello,

I am fine-tuning dolphin-mixtral with axolotl. I am inspired by your config, I chose qlora modules. What is interesting, that I can fine-tune dolphin-2.5-mixtral-8x7b, but when I just change "5" to "6" and otherwise keep the config the same (and remove and recreate the output directory), I get the following error:

 File "/root/miniconda3/envs/py3.10/lib/python3.10/site-packages/transformers/trainer.py", line 1854, in _inner_training_loop
    tr_loss_step = self.training_step(model, inputs)
  File "/root/miniconda3/envs/py3.10/lib/python3.10/site-packages/transformers/trainer.py", line 2735, in training_step
    loss = self.compute_loss(model, inputs)
  File "/workspace/axolotl/src/axolotl/core/trainer_builder.py", line 291, in compute_loss
    return super().compute_loss(model, inputs, return_outputs=return_outputs)
  File "/root/miniconda3/envs/py3.10/lib/python3.10/site-packages/transformers/trainer.py", line 2758, in compute_loss
    outputs = model(**inputs)
  File "/root/miniconda3/envs/py3.10/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/root/miniconda3/envs/py3.10/lib/python3.10/site-packages/accelerate/utils/operations.py", line 659, in forward
    return model_forward(*args, **kwargs)
  File "/root/miniconda3/envs/py3.10/lib/python3.10/site-packages/accelerate/utils/operations.py", line 647, in __call__
    return convert_to_fp32(self.model_forward(*args, **kwargs))
  File "/root/miniconda3/envs/py3.10/lib/python3.10/site-packages/torch/amp/autocast_mode.py", line 14, in decorate_autocast
    return func(*args, **kwargs)
  File "/root/miniconda3/envs/py3.10/lib/python3.10/site-packages/peft/peft_model.py", line 977, in forward
    return self.base_model(
  File "/root/miniconda3/envs/py3.10/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/root/miniconda3/envs/py3.10/lib/python3.10/site-packages/peft/tuners/tuners_utils.py", line 106, in forward
    return self.model.forward(*args, **kwargs)
  File "/root/miniconda3/envs/py3.10/lib/python3.10/site-packages/accelerate/hooks.py", line 164, in new_forward
    output = module._old_forward(*args, **kwargs)
  File "/root/miniconda3/envs/py3.10/lib/python3.10/site-packages/transformers/models/mixtral/modeling_mixtral.py", line 1258, in forward
    loss += self.router_aux_loss_coef * aux_loss
RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:6 and cuda:0!

Any idea what could be different about 2.5 vs 2.6 that could cause this?

2.7 is the same btw.

I'm trying, but changing output_router_logits to false does not help

Sign up or log in to comment