Inference on two H100 doesn't work
Hi,
the inference, even with the code you've provided, doesn't work for me with two H100:
Expected all tensors to be on the same device, but found at least two devices, cpu and cuda:0! (when checking argument for argument mat2 in method wrapper_CUDA_bmm)
Hi, it seems that the image is not placed on the correct GPU, you can try to fix it.
I could fix it like this:
FIX inv_freq_expanded is on cpu causes matrix multiplication Failure !
@torch
.no_grad()
def rot_embed_forward_fix(self, x, position_ids):
if "dynamic" in self.rope_type:
self._dynamic_frequency_update(position_ids, device=x.device)
# Core RoPE block
inv_freq_expanded = self.inv_freq[None, :, None].float().expand(position_ids.shape[0], -1, 1).to(x.device) # FIX
position_ids_expanded = position_ids[:, None, :].float()
# Force float32 (see https://github.com/huggingface/transformers/pull/29285)
device_type = x.device.type
device_type = device_type if isinstance(device_type, str) and device_type != "mps" else "cpu"
with torch.autocast(device_type=device_type, enabled=False):
freqs = (inv_freq_expanded.float() @ position_ids_expanded.float()).transpose(1, 2)
emb = torch.cat((freqs, freqs), dim=-1)
cos = emb.cos()
sin = emb.sin()
# Advanced RoPE types (e.g. yarn) apply a post-processing scaling factor, equivalent to scaling attention
cos = cos * self.attention_scaling
sin = sin * self.attention_scaling
return cos.to(dtype=x.dtype), sin.to(dtype=x.dtype)
self.model = AutoModel.from_pretrained(
path,
torch_dtype=torch.bfloat16,
load_in_8bit=True,
load_in_4bit=False,
low_cpu_mem_usage=True,
trust_remote_code=True,
device_map=device_map
).eval()
if '40B' in self.model_name or '76B' in self.model_name:
self.model.language_model.model.rotary_emb.__class__.forward = rot_embed_forward_fix
My Issue was: inv_freq_expanded beeing on cpu, only way i could find to fix this was to override: model.language_model.model.rotary_emb.forward with the supplied Function
For me this is necessary for the llama Family Language Models (40B and 76B Model), I am running on P40s .
@HondaVfr800 @czczup I am unable to replicate your code. I am defining a custom chat model so I can use the model with langchain as well. I am running on 4Nvidia A10Gs. Does the following look accurate?
from typing import Any, List, Optional
from langchain_core.language_models import BaseChatModel
from langchain_core.messages import AIMessage, BaseMessage
from langchain_core.outputs import ChatGeneration, ChatResult
import torch
import math
from transformers import AutoModel, AutoTokenizer
@torch
.no_grad()
def rot_embed_forward_fix(self, x, position_ids):
if "dynamic" in self.rope_type:
self._dynamic_frequency_update(position_ids, device=x.device)
# Core RoPE block
inv_freq_expanded = self.inv_freq[None, :, None].float().expand(position_ids.shape[0], -1, 1).to(x.device) # FIX
position_ids_expanded = position_ids[:, None, :].float()
# Force float32 (see https://github.com/huggingface/transformers/pull/29285)
device_type = x.device.type
device_type = device_type if isinstance(device_type, str) and device_type != "mps" else "cpu"
with torch.autocast(device_type=device_type, enabled=False):
freqs = (inv_freq_expanded.float() @ position_ids_expanded.float()).transpose(1, 2)
emb = torch.cat((freqs, freqs), dim=-1)
cos = emb.cos()
sin = emb.sin()
# Advanced RoPE types (e.g. yarn) apply a post-processing scaling factor, equivalent to scaling attention
cos = cos * self.attention_scaling
sin = sin * self.attention_scaling
return cos.to(dtype=x.dtype), sin.to(dtype=x.dtype)
class CustomChatModel(BaseChatModel):
model : Any = None
tokenizer : Any = None
generation_config : dict = None
def __init__(self, model_path: str, model_name):
super().__init__()
self.model = AutoModel.from_pretrained(
model_path,
torch_dtype=torch.bfloat16,
load_in_8bit=True,
low_cpu_mem_usage=True,
trust_remote_code=True,
device_map=self.split_model(model_name)
).eval()
self.model.language_model.model.rotary_emb.__class__.forward = rot_embed_forward_fix
self.tokenizer = AutoTokenizer.from_pretrained(
model_path,
trust_remote_code=True,
use_fast=False
)
self.generation_config = dict(max_new_tokens=1024, do_sample=True, temperature=0.001)
def _generate(
self,
messages: List[BaseMessage],
stop: Optional[List[str]] = None,
run_manager: Optional[Any] = None,
**kwargs: Any
) -> ChatResult:
"""Override the _generate method to implement the chat model logic.
Args:
messages (List[BaseMessage]): The list of messages to generate responses from.
stop (Optional[List[str]], optional): The list of stop words. Defaults to None.
run_manager (Optional[Any], optional): The run manager. Defaults to None.
returns:
ChatResult: The chat result as a LangChain object to be used by parsers.
"""
prompt = messages[-1].content
response = self.model.chat(
self.tokenizer,
None,
prompt,
self.generation_config
)
message = AIMessage(content=response)
generation = ChatGeneration(message=message)
return ChatResult(generations=[generation])
def chat(self, pixel_values,prompt,generation_config: Optional[dict] = None, num_patches_list = None) -> str:
""" Generate a response to a multimodal input.
Args:
pixel_values (torch.Tensor): Pixel values of the input image.
prompt (str): The prompt to generate a response from.
generation_config (Optional[dict], optional): The generation config. Defaults to None.
Returns:
str: The generated response.
"""
if generation_config is None: generation_config = self.generation_config
if num_patches_list is None:
return self.model.chat(self.tokenizer, pixel_values, prompt, generation_config)
else:
return self.model.chat(self.tokenizer, pixel_values, prompt, generation_config, num_patches_list=num_patches_list)
def split_model(self,model_name):
device_map = {}
world_size = torch.cuda.device_count()
num_layers = {
'InternVL2-1B': 24, 'InternVL2-2B': 24, 'InternVL2-4B': 32, 'InternVL2-8B': 32,
'InternVL2-26B': 48, 'InternVL2-40B': 60, 'InternVL2-Llama3-76B': 80}[model_name]
# Since the first GPU will be used for ViT, treat it as half a GPU.
num_layers_per_gpu = math.ceil(num_layers / (world_size - 0.5))
num_layers_per_gpu = [num_layers_per_gpu] * world_size
num_layers_per_gpu[0] = math.ceil(num_layers_per_gpu[0] * 0.5)
layer_cnt = 0
for i, num_layer in enumerate(num_layers_per_gpu):
for j in range(num_layer):
device_map[f'language_model.model.layers.{layer_cnt}'] = i
layer_cnt += 1
device_map['vision_model'] = 0
device_map['mlp1'] = 0
device_map['language_model.model.tok_embeddings'] = 0
device_map['language_model.model.embed_tokens'] = 0
device_map['language_model.output'] = 0
device_map['language_model.model.norm'] = 0
device_map['language_model.lm_head'] = 0
device_map[f'language_model.model.layers.{num_layers - 1}'] = 0
return device_map
@property
def _llm_type(self) -> str:
"""Get the type of language model used by this chat model."""
return "InternVL2-40B"
# Usage example
if __name__ == "__main__":
# Example instantiation and usage
model_path = "./supply/InternVL2-2B" # Path to the model
model = CustomChatModel(model_path=model_path)
# Example conversation
prompt = "Hello, who are you?"
output = model.invoke(prompt)
print(output)
We have not explored supporting langchain yet. We also welcome contributions from the community. Are there any problems running this code?
Hello, thank you for your feedback. Could you please let me know which version of Transformers you are using? I have seen related issues where this error occurs when using newer versions of Transformers, such as 4.44.0. If you downgrade to 4.37.2, the issue can be resolved.
Hello, I was able to resolve the issues I was facing by using the above rotary encoding fix as well as setting the following environment variableSet PYTORCH_CUDA_ALLOC_CONF=backend:cudaMallocAsync
Additionally, running torch.cuda.empty_cache()
after each inference was extremely useful in making sure we didn't run out of GPU memory
Finally, this code is fully operational as part of a langchain agent, and can be used simply. For example, the following code creates a custom model and then instantiates an output fixing parser. (https://python.langchain.com/v0.1/docs/modules/model_io/output_parsers/types/output_fixing/)llm = CustomChatModel(model_path, model_name="InternVL2-40B")
fix_parser = OutputFixingParser.from_llm(parser=parser, llm=llm)
If you would like me to make a community contribution, kindly direct me to the best way to do so.
Thank you very much for the feedback.