Could not find a matching NEFF for your HLO in this directory. When trying to load precompiled neuron artifacts
I want to run Llama2 for inference on a inf2.xlarge
instance (I do not have access to larger instances). Since my instance runs out of RAM when compiling the model, I want to use the precompiled artifacts that come with this repo. My Neuron version is 2.16.1
, I have downloaded the models to the instance and I am running the following code:
import torch
from transformers import AutoTokenizer
from transformers_neuronx.llama.model import LlamaForCausalLM
print("Loading model...")
NEURON_MODEL = LlamaForCausalLM.from_pretrained(
"Llama-2-7b-chat-hf-seqlen-2048-bs-1/checkpoint", batch_size=1, tp_degree=2, amp="f16"
)
print("Loading Neuron Artifacts")
NEURON_MODEL.load(
"Llama-2-7b-chat-hf-seqlen-2048-bs-1/compiled"
) # Load the compiled Neuron artifacts
NEURON_MODEL.to_neuron()
# construct a tokenizer and encode prompt text
print("Constructing tokenizer...")
TOKENIZER = AutoTokenizer.from_pretrained("Llama-2-7b-chat-hf-seqlen-2048-bs-1")
I can see that the model is being loaded on the device with neuron-top
, but after a while it crashes with the following error:
NEURON_MODEL.to_neuron()
File "/opt/aws_neuron_venv_pytorch/lib/python3.8/site-packages/transformers_neuronx/base.py", line 62,
in to_neuron
self._load_compiled_artifacts(self._compiled_artifacts_directory)
File "/opt/aws_neuron_venv_pytorch/lib/python3.8/site-packages/transformers_neuronx/base.py", line 108,
in _load_compiled_artifacts
nbs_obj.set_neff_bytes(directory)
File "/opt/aws_neuron_venv_pytorch/lib/python3.8/site-packages/transformers_neuronx/base.py", line 387,
in set_neff_bytes
raise FileNotFoundError(('Could not find a matching NEFF for your HLO in this directory. '
FileNotFoundError: Could not find a matching NEFF for your HLO in this directory. Ensure that the model y
ou are trying to load is the same type and has the same parameters as the one you saved or call "save" on
this model to reserialize it.
I am not sure what is going on here, what am I doing wrong?
This is a model serialized for optimum-neuron
. It is not intended to be used with transformers_neuronx
directly, although the content of the checkpoint
and compiled
directory are indeed compatible with that package.
Your sequence of calls seem correct though, but unfortunately the compiled artifacts are for AWS Neuron SDK 2.15 only.
https://github.com/aws-neuron/transformers-neuronx/issues/78
You can use https://huggingface.co/aws-neuron/Llama-2-7b-hf-neuron-latency instead, which has compiled artifacts compatible with 2.16.0
.
Alternatively, using optimum-neuron
, you can still export the model without needing to recompile it thanks to https://huggingface.co/aws-neuron/optimum-neuron-cache.
Thanks for your quick answer! I have tried running the same code with the model you mention (https://huggingface.co/aws-neuron/Llama-2-7b-hf-neuron-latency) and a checkpoint from meta-llama/Llama-2-7b-chat-hf (since this model does not seem to include checkpoints). But I am running into issues still.
NEURON_MODEL.to_neuron()
File "/opt/aws_neuron_venv_pytorch/lib/python3.8/site-packages/transformers_neuronx/base.py", line 60,
in to_neuron
self.load_weights()
File "/opt/aws_neuron_venv_pytorch/lib/python3.8/site-packages/transformers_neuronx/llama/model.py", li
ne 84, in load_weights
layer.materialize()
File "/opt/aws_neuron_venv_pytorch/lib/python3.8/site-packages/transformers_neuronx/module.py", line 71
, in materialize
param.copy_(input_param)
NotImplementedError: Cannot copy out of meta tensor; no data!
Is the model you mention compiled for meta-llama/Llama-2-7b-chat-hf or meta-llama/Llama-2-7b-hf? The model card mentions both and I suspect it might be related to that.