Expected all tensors to be on the same device, but found at least two devices, cuda:2 and cuda:0!

#6
by Maverick17 - opened

Hi,

I'm encountering again this issue: https://huggingface.co/OpenGVLab/InternVL2-Llama3-76B/discussions/1 using 2 H100 devices. Your custom code seems not to fix it.

I can run it normally using 4x 80G A100s. Could you please provide more error information?

I could fix it like this:

FIX inv_freq_expanded is on cpu causes matrix multiplication Failure !

@torch .no_grad()
def rot_embed_forward_fix(self, x, position_ids):
if "dynamic" in self.rope_type:
self._dynamic_frequency_update(position_ids, device=x.device)

# Core RoPE block
inv_freq_expanded = self.inv_freq[None, :, None].float().expand(position_ids.shape[0], -1, 1).to(x.device) # FIX
position_ids_expanded = position_ids[:, None, :].float()
# Force float32 (see https://github.com/huggingface/transformers/pull/29285)
device_type = x.device.type
device_type = device_type if isinstance(device_type, str) and device_type != "mps" else "cpu"
with torch.autocast(device_type=device_type, enabled=False):
    freqs = (inv_freq_expanded.float() @ position_ids_expanded.float()).transpose(1, 2)
    emb = torch.cat((freqs, freqs), dim=-1)
    cos = emb.cos()
    sin = emb.sin()

# Advanced RoPE types (e.g. yarn) apply a post-processing scaling factor, equivalent to scaling attention
cos = cos * self.attention_scaling
sin = sin * self.attention_scaling

return cos.to(dtype=x.dtype), sin.to(dtype=x.dtype)

    self.model = AutoModel.from_pretrained(
       path,
       torch_dtype=torch.bfloat16,
       load_in_8bit=True, 
       load_in_4bit=False,
       low_cpu_mem_usage=True,
       trust_remote_code=True,
       device_map=device_map
    ).eval()

    if '40B' in self.model_name or '76B' in self.model_name:
       self.model.language_model.model.rotary_emb.__class__.forward = rot_embed_forward_fix 

My Issue was: inv_freq_expanded beeing on cpu, only way i could find to fix this was to override: model.language_model.model.rotary_emb.forward with the supplied Function

But for me it was cpu and cuda:0, might still resolve your issue, or help someone else at least.
For me this is necessary for the llama Family Language Models (40B and 76B Model), I am running on P40s .

Thanks @HondaVfr800 ! I can confirm that your fix is working on two H100

zwgao changed discussion status to closed

Sign up or log in to comment