Unable to successfully compile the model meta-llama/Llama-2-7b-chat-hf on Inf2 instance

#1
by WaelDataReply - opened

I tried to compile the same model ( meta-llama/Llama-2-7b-chat-hf) using the same configuration proposed here on an EC2 instance of type Inf2 with the Hugging Face Neuron Deep Learning AMI: sequence_length: 2048; batch_size: 2; neuron: 2.15.0.
Unfortunately, I consistently encountered the error: "AttributeError: type object 'LLamaNeuronConfig' does not have the attribute 'get_mandatory_axes_for_task'." Despite specifying the task as text-generation, the issue persisted without any improvement.
Is there an additional configuration required for compiling the model?

AWS Inferentia and Trainium org

Can you post the exact command you used ? There is a specific procedure for text-generation: https://huggingface.co/docs/optimum-neuron/guides/models#generative-nlp-models

I attempted to compile the model using two methods as outlined in the text-generation procedure provided here (https://huggingface.co/docs/optimum-neuron/guides/models#generative-nlp-models):

1- Executing the following code:

from optimum.neuron import NeuronModelForCausalLM
from transformers import AutoTokenizer

model id you want to compile

model_id = "meta-llama/Llama-2-7b-chat-hf"

configs for compiling model

compiler_args = {"num_cores": 2, "auto_cast_type": "fp16"}
input_shapes = {
"sequence_length": 2048, # max length to generate
"batch_size": 2 # batch size for the model
}

llm = NeuronModelForCausalLM.from_pretrained(model_id, export=True, **input_shapes, **compiler_args)
#tokenizer = AutoTokenizer.from_pretrained(model_id)

Save locally or upload to the HuggingFace Hub

save_directory = "llama_neuron"
llm.save_pretrained(save_directory)
#tokenizer.save_pretrained(save_directory)

2- Running the following commands:

optimum-cli export neuron --model meta-llama/Llama-2-7b-chat-hf --batch_size 1 --sequence_length 4096 llama2-7B-compiled/
optimum-cli export neuron --model meta-llama/Llama-2-7b-chat-hf --task text-generation --batch_size 1 --sequence_length 4096 llama2-7B-compiled/

AWS Inferentia and Trainium org

The second method is not supported for text-generation, and you should get the error you mentioned.
The first should work: what kind of error do you get ?

AWS Inferentia and Trainium org

@WaelDataReply
You say that you are using the Hugging Face DLAMI, but you also list Neuron 2.15.0. The latest HF DLAMI (as of the end of December) includes Neuron 2.16.0. They also update some of the other libraries like transformer-neuronx. Even though the earlier version may say that it supports Llama 7B, there is a distinction between training support and inference support.

Redeploy your system and use the latest AMI version. I recommend a inf2.8xlarge as the inf2.xlarge can sometimes run out of RAM on compiling. Also, change the default drive space to at least 200GB for compilation.

I'm running through your code now on the latest HF DLAMI, and it is into compiling the neffs. No errors yet.

Jim

AWS Inferentia and Trainium org

@WaelDataReply
FYI, compilation completed successfully on the latest version of the HF DLAMI. I followed the rest of your code to save it out (along with the tokenizer).

I then quit() python and started it again (to release the Neuron cores). I then successfully ran the model with:

>>> pipe = pipeline("text-generation", "llama_neuron")
>>> messages = [
...     {"role": "user", "content": "What is 2+2?"},
... ]
>>> prompt = pipe.tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
>>> # Run generation
>>> outputs = pipe(prompt, max_new_tokens=256, do_sample=True, temperature=0.7, top_k=50, top_p=0.95)
Both `max_new_tokens` (=256) and `max_length`(=4096) seem to have been set. `max_new_tokens` will take precedence. Please refer to the documentation for more information. (https://huggingface.co/docs/transformers/main/en/main_classes/text_generation)
Inputs will be padded to match the model static batch size. This will increase latency.
2024-Jan-08 15:56:10.0912 6318:6500 [1] nccl_net_ofi_init:1415 CCOM WARN NET/OFI aws-ofi-nccl initialization failed
2024-Jan-08 15:56:10.0912 6318:6500 [1] init.cc:138 CCOM WARN OFI plugin initNet() failed is EFA enabled?
>>> print(outputs[0]["generated_text"])
<s>[INST] What is 2+2? [/INST]  The answer to 2+2 is 4.
>>>

I found that using Neuron version 2.16.0 on an inf2.48xlarge instance, instead of version 2.15.0, was essential for successfully compiling the model. The updated configuration resolved the issue, and the model has been compiled and pushed to HF.
Thanks for the help.

Sign up or log in to comment