Target_module of this phi-3-small model
after loading the model use the
for name , module in model.named_modules():
print(name)
to get the module of the layers
for this model it is [ up_proj , down_proj ]
I'm sorry, but would it be possible to clarify a bit more on the question or provide some additional context ? I'm not sure I understand the issue.
import transformers
model_name = "microsoft/Phi-3-small-128k-instruct" # Replace with your desired Phi-3-Small variant
model = transformers.AutoModelForSeq2SeqLM.from_pretrained(model_name)
for name, module in model.named_modules():
print(name)
By running this code, you'll obtain a comprehensive list of all the modules within the model, including those specifically related to its layers. For Phi-3-Small, you can expect to see output similar to:
up_proj
down_proj
... # Other modules in the model
This reveals that the key modules associated with layers in the Phi-3-Small model are named up_proj and down_proj. It's essential to consult the Phi-3 documentation for a detailed explanation of their roles within the model's architecture.
That is accurate.up_proj
and down_proj
are a part of the MLP layer with GEGLU activation (https://arxiv.org/pdf/2002.05202)
See this line.
I was thrown the runtime error when inferencing the model using device_map = "auto". Does it only works with a single GPU for inferencing?
This problem only happens with small; medium and mini work just fine. :shrug.
RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cuda:1! (when checking argument for argument tensors in method wrapper_CUDA_cat)
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, pipeline
torch.random.manual_seed(0)
model_id = "microsoft/Phi-3-small-8k-instruct"
model = AutoModelForCausalLM.from_pretrained(
model_id,
torch_dtype="auto",
trust_remote_code=True,
device_map="auto",
)
assert torch.cuda.is_available(), "This model needs a GPU to run ..."
tokenizer = AutoTokenizer.from_pretrained(model_id)
messages = [
{"role": "user", "content": "Can you provide ways to eat combinations of bananas and dragonfruits?"},
{"role": "assistant", "content": "Sure! Here are some ways to eat bananas and dragonfruits together: 1. Banana and dragonfruit smoothie: Blend bananas and dragonfruits together with some milk and honey. 2. Banana and dragonfruit salad: Mix sliced bananas and dragonfruits together with some lemon juice and honey."},
{"role": "user", "content": "What about solving an 2x + 3 = 7 equation?"},
]
pipe = pipeline(
"text-generation",
model=model,
tokenizer=tokenizer,
device_map="auto",
)
generation_args = {
"max_new_tokens": 500,
"return_full_text": False,
"temperature": 0.0,
"do_sample": False,
}
output = pipe(messages, **generation_args)
print(output[0]['generated_text'])
Huh interesting,
For some reason, seems like the pipeline allocated the model on one GPU, and the tensors on another (one on "cuda:0", the other one on "cuda:1").
I'd say it might be better to explicitly control the device placement, just to avoid any confusion. Copying from the README below
model_id = "microsoft/Phi-3-small-8k-instruct"
model = AutoModelForCausalLM.from_pretrained(
model_id,
torch_dtype="auto",
trust_remote_code=True,
)
assert torch.cuda.is_available(), "This model needs a GPU to run ..."
device = torch.cuda.current_device() # <----- Explicitly specifying the device to send the model to
model = model.to(device) # <----- Send the model to the particular device
tokenizer = AutoTokenizer.from_pretrained(model_id)
messages = [
{"role": "user", "content": "Can you provide ways to eat combinations of bananas and dragonfruits?"},
{"role": "assistant", "content": "Sure! Here are some ways to eat bananas and dragonfruits together: 1. Banana and dragonfruit smoothie: Blend bananas and dragonfruits together with some milk and honey. 2. Banana and dragonfruit salad: Mix sliced bananas and dragonfruits together with some lemon juice and honey."},
{"role": "user", "content": "What about solving an 2x + 3 = 7 equation?"},
]
pipe = pipeline(
"text-generation",
model=model,
tokenizer=tokenizer,
device=device # <----- Also tell the pipeline to use the same device while creating the input tensors
)
Let me know if this fixes the issue ?
By multi-GPU inferencing, do you want to do data parallel inferencing, or tensor-slicing ?
Data parallelism can be done by running the script with any launcher of your choice (torchrun/deepspeed/mpi
, just set the current_device
correctly based on local rank, and that should work imo).
Tensor slicing is a separate problem: hard to give more info without knowing how you want to do the tensor-slicing.
Thanks. Assigning both the pipeline and model to the same device works.
I'm still not sure why setting device_map="auto"
only fails at small but not medium nor mini?
I have tried on A10G with the following code
model_id = "microsoft/Phi-3-small-128k-instruct"
model_kwargs = dict(
use_cache=False,
trust_remote_code=True,
attn_implementation="flash_attention_2", # loading the model with flash-attenstion support
torch_dtype=torch.bfloat16,
device_map=None
)
model = AutoModelForCausalLM.from_pretrained( model_id, **model_kwargs)
assert torch.cuda.is_available(), "This model needs a GPU to run ..."
device = torch.cuda.current_device()
model = model.to(device)
tokenizer = AutoTokenizer.from_pretrained(model_id)
still the code is throwing the error
AssertionError: Flash Attention is not available, but is needed for dense attention
@hackint0sh
Hi there! The inference code (here) assumes that flash-attn
is installed.
Run pip install flash-attn
to fix the error.
$ pip install flash-attn
Cheers!
@hackint0sh Hi there! The inference code (here) assumes that
flash-attn
is installed.Run
pip install flash-attn
to fix the error.
$ pip install flash-attn
Cheers!
Doesn't work for me:
Traceback (most recent call last):
File "/home/ubuntu/Multimodal-Uncertainty-Quantification/playground/construct_graph2.py", line 24, in <module>
model = AutoModelForCausalLM.from_pretrained("numind/NuExtract-large", torch_dtype=torch.bfloat16, trust_remote_code=True)
File "/home/ubuntu/miniconda3/envs/llava/lib/python3.10/site-packages/transformers/models/auto/auto_factory.py", line 559, in from_pretrained
return model_class.from_pretrained(
File "/home/ubuntu/miniconda3/envs/llava/lib/python3.10/site-packages/transformers/modeling_utils.py", line 3788, in from_pretrained
model = cls(config, *model_args, **model_kwargs)
File "/home/ubuntu/.cache/huggingface/modules/transformers_modules/numind/NuExtract-large/fc8e001871f4a6be8e6079093b33de334a2316c9/modeling_phi3_small.py", line 903, in __init__
self.model = Phi3SmallModel(config)
File "/home/ubuntu/.cache/huggingface/modules/transformers_modules/numind/NuExtract-large/fc8e001871f4a6be8e6079093b33de334a2316c9/modeling_phi3_small.py", line 745, in __init__
self.layers = nn.ModuleList([Phi3SmallDecoderLayer(config, layer_idx) for layer_idx in range(config.num_hidden_layers)])
File "/home/ubuntu/.cache/huggingface/modules/transformers_modules/numind/NuExtract-large/fc8e001871f4a6be8e6079093b33de334a2316c9/modeling_phi3_small.py", line 745, in <listcomp>
self.layers = nn.ModuleList([Phi3SmallDecoderLayer(config, layer_idx) for layer_idx in range(config.num_hidden_layers)])
File "/home/ubuntu/.cache/huggingface/modules/transformers_modules/numind/NuExtract-large/fc8e001871f4a6be8e6079093b33de334a2316c9/modeling_phi3_small.py", line 651, in __init__
self.self_attn = Phi3SmallSelfAttention(config, layer_idx)
File "/home/ubuntu/.cache/huggingface/modules/transformers_modules/numind/NuExtract-large/fc8e001871f4a6be8e6079093b33de334a2316c9/modeling_phi3_small.py", line 218, in __init__
assert is_flash_attention_available, "Flash Attention is not available, but is needed for dense attention"
AssertionError: Flash Attention is not available, but is needed for dense attention
Hi
@tpadhi1
! π€ The error message is shown when this code block fails, which implies that the following code snippet will raise ImportError
in your environment:
import flash_attn
if int(flash_attn.__version__.split('.')[0]) < 2:
from flash_attn.flash_attn_interface import (
flash_attn_func,
flash_attn_unpadded_kvpacked_func as flash_attn_varlen_kvpacked_func,
)
# rename `max_seqlen`
def flash_attn_varlen_qkvpacked_func(qkv, cu_seqlens, max_seqlen, dropout_p=0.0, **kwargs):
return flash_attn_func(qkv, cu_seqlens, dropout_p=dropout_p, max_s=max_seqlen, **kwargs)
else:
from flash_attn.flash_attn_interface import (
flash_attn_varlen_kvpacked_func,
)
from flash_attn.bert_padding import index_first_axis, pad_input, unpad_input
is_flash_attention_available = True
Can you run the above code? It should raise an exception, which will help you narrow down the root cause.