Could not find a matching NEFF for your HLO in this directory. When trying to load precompiled neuron artifacts

#2
by luuksuurmeijer - opened

I want to run Llama2 for inference on a inf2.xlarge instance (I do not have access to larger instances). Since my instance runs out of RAM when compiling the model, I want to use the precompiled artifacts that come with this repo. My Neuron version is 2.16.1, I have downloaded the models to the instance and I am running the following code:

import torch
from transformers import AutoTokenizer
from transformers_neuronx.llama.model import LlamaForCausalLM


print("Loading model...")
NEURON_MODEL = LlamaForCausalLM.from_pretrained(
    "Llama-2-7b-chat-hf-seqlen-2048-bs-1/checkpoint", batch_size=1, tp_degree=2, amp="f16"
)
print("Loading Neuron Artifacts")
NEURON_MODEL.load(
    "Llama-2-7b-chat-hf-seqlen-2048-bs-1/compiled"
)  # Load the compiled Neuron artifacts
NEURON_MODEL.to_neuron()

# construct a tokenizer and encode prompt text
print("Constructing tokenizer...")
TOKENIZER = AutoTokenizer.from_pretrained("Llama-2-7b-chat-hf-seqlen-2048-bs-1")

I can see that the model is being loaded on the device with neuron-top, but after a while it crashes with the following error:

    NEURON_MODEL.to_neuron()
  File "/opt/aws_neuron_venv_pytorch/lib/python3.8/site-packages/transformers_neuronx/base.py", line 62,                                                                                         
in to_neuron                                                          
    self._load_compiled_artifacts(self._compiled_artifacts_directory)                                                                       
  File "/opt/aws_neuron_venv_pytorch/lib/python3.8/site-packages/transformers_neuronx/base.py", line 108,                                                                            
 in _load_compiled_artifacts               
    nbs_obj.set_neff_bytes(directory)                                                           
  File "/opt/aws_neuron_venv_pytorch/lib/python3.8/site-packages/transformers_neuronx/base.py", line 387,
 in set_neff_bytes       
    raise FileNotFoundError(('Could not find a matching NEFF for your HLO in this directory. '                                                                                        
FileNotFoundError: Could not find a matching NEFF for your HLO in this directory. Ensure that the model y                                                                                                
ou are trying to load is the same type and has the same parameters as the one you saved or call "save" on                                                                                             
 this model to reserialize it.     

I am not sure what is going on here, what am I doing wrong?

AWS Inferentia and Trainium org
edited Jan 24

This is a model serialized for optimum-neuron. It is not intended to be used with transformers_neuronx directly, although the content of the checkpoint and compiled directory are indeed compatible with that package.

AWS Inferentia and Trainium org

Your sequence of calls seem correct though, but unfortunately the compiled artifacts are for AWS Neuron SDK 2.15 only.
https://github.com/aws-neuron/transformers-neuronx/issues/78

AWS Inferentia and Trainium org

You can use https://huggingface.co/aws-neuron/Llama-2-7b-hf-neuron-latency instead, which has compiled artifacts compatible with 2.16.0.
Alternatively, using optimum-neuron, you can still export the model without needing to recompile it thanks to https://huggingface.co/aws-neuron/optimum-neuron-cache.

Thanks for your quick answer! I have tried running the same code with the model you mention (https://huggingface.co/aws-neuron/Llama-2-7b-hf-neuron-latency) and a checkpoint from meta-llama/Llama-2-7b-chat-hf (since this model does not seem to include checkpoints). But I am running into issues still.

   NEURON_MODEL.to_neuron()
  File "/opt/aws_neuron_venv_pytorch/lib/python3.8/site-packages/transformers_neuronx/base.py", line 60,                                               
in to_neuron
    self.load_weights()                      
  File "/opt/aws_neuron_venv_pytorch/lib/python3.8/site-packages/transformers_neuronx/llama/model.py", li                                         
ne 84, in load_weights                                         
    layer.materialize()     
  File "/opt/aws_neuron_venv_pytorch/lib/python3.8/site-packages/transformers_neuronx/module.py", line 71                                           
, in materialize                                            
    param.copy_(input_param)                                  
NotImplementedError: Cannot copy out of meta tensor; no data!  

Is the model you mention compiled for meta-llama/Llama-2-7b-chat-hf or meta-llama/Llama-2-7b-hf? The model card mentions both and I suspect it might be related to that.

Sign up or log in to comment