No multi GPU inference support?

#4
by dataautogpt3 - opened

RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cuda:1! (when checking argument for argument tensors in method wrapper_CUDA_cat)
Output generated in 2.42 seconds (0.00 tokens/s, 0 tokens, context 65, seed 459973075)
seems to me like there is a total lack of multi GPU support for inference.

I would appreciate it if this was addressed.
best wishes and thank you so much for your hard work!

hi @dataautogpt3
Can you share a reproducible snippet together with the full traceback of the error? thanks

I'm getting the same issue with the following code:

from transformers import AutoModelForCausalLM, AutoTokenizer

model_id = "mistralai/Mixtral-8x7B-v0.1"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(model_id, device_map="auto")

text = "Hello my name is"
inputs = tokenizer(text, return_tensors="pt").cuda()

outputs = model.generate(**inputs, max_new_tokens=20)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

results in:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/opt/conda/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/run/determined/pythonuserbase/lib/python3.10/site-packages/transformers/generation/utils.py", line 1718, in generate
    return self.greedy_search(
  File "/run/determined/pythonuserbase/lib/python3.10/site-packages/transformers/generation/utils.py", line 2579, in greedy_search
    outputs = self(
  File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/accelerate/hooks.py", line 164, in new_forward
    output = module._old_forward(*args, **kwargs)
  File "/run/determined/pythonuserbase/lib/python3.10/site-packages/transformers/models/mixtral/modeling_mixtral.py", line 1244, in forward
    aux_loss = load_balancing_loss_func(
  File "/run/determined/pythonuserbase/lib/python3.10/site-packages/transformers/models/mixtral/modeling_mixtral.py", line 98, in load_balancing_loss_func
    gate_logits = torch.cat(gate_logits, dim=0)
RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cuda:1! (when checking argument for argument tensors in method wrapper_CUDA_cat)

that for computing the loss, I think the code is not the latest cuz I pushed a fix but will check

@bjoernp can you try:

from transformers import AutoModelForCausalLM, AutoTokenizer

model_id = "mistralai/Mixtral-8x7B-v0.1"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(model_id, device_map="auto", revision="refs/pr/5")

text = "Hello my name is"
inputs = tokenizer(text, return_tensors="pt").cuda()

outputs = model.generate(**inputs, max_new_tokens=20)

Works! Thanks :)

@bjoernp Hi,

Does the code above should parallelize the model across multiple gpus? Is the device_map='auto' does this work?

Thanks.

Hi @bweinstein123
Yes, device_map="auto" should split the model evenly across all GPUs

Sign up or log in to comment