Expected all tensors to be on the same device, but found at least two devices, cuda:2 and cuda:0!
Hi,
I'm encountering again this issue: https://huggingface.co/OpenGVLab/InternVL2-Llama3-76B/discussions/1 using 2 H100 devices. Your custom code seems not to fix it.
I can run it normally using 4x 80G A100s. Could you please provide more error information?
I could fix it like this:
FIX inv_freq_expanded is on cpu causes matrix multiplication Failure !
@torch
.no_grad()
def rot_embed_forward_fix(self, x, position_ids):
if "dynamic" in self.rope_type:
self._dynamic_frequency_update(position_ids, device=x.device)
# Core RoPE block
inv_freq_expanded = self.inv_freq[None, :, None].float().expand(position_ids.shape[0], -1, 1).to(x.device) # FIX
position_ids_expanded = position_ids[:, None, :].float()
# Force float32 (see https://github.com/huggingface/transformers/pull/29285)
device_type = x.device.type
device_type = device_type if isinstance(device_type, str) and device_type != "mps" else "cpu"
with torch.autocast(device_type=device_type, enabled=False):
freqs = (inv_freq_expanded.float() @ position_ids_expanded.float()).transpose(1, 2)
emb = torch.cat((freqs, freqs), dim=-1)
cos = emb.cos()
sin = emb.sin()
# Advanced RoPE types (e.g. yarn) apply a post-processing scaling factor, equivalent to scaling attention
cos = cos * self.attention_scaling
sin = sin * self.attention_scaling
return cos.to(dtype=x.dtype), sin.to(dtype=x.dtype)
self.model = AutoModel.from_pretrained(
path,
torch_dtype=torch.bfloat16,
load_in_8bit=True,
load_in_4bit=False,
low_cpu_mem_usage=True,
trust_remote_code=True,
device_map=device_map
).eval()
if '40B' in self.model_name or '76B' in self.model_name:
self.model.language_model.model.rotary_emb.__class__.forward = rot_embed_forward_fix
My Issue was: inv_freq_expanded beeing on cpu, only way i could find to fix this was to override: model.language_model.model.rotary_emb.forward with the supplied Function
But for me it was cpu and cuda:0, might still resolve your issue, or help someone else at least.
For me this is necessary for the llama Family Language Models (40B and 76B Model), I am running on P40s .
Thanks @HondaVfr800 ! I can confirm that your fix is working on two H100