Issues with deployment from SageMaker script

#1
by alexspasov - opened

Hi,
I am trying to deploy this model on the recommended instance by using the following modified script since the one provided was not working at all resulting in errors like: RuntimeError: weight transformer.h.0.self_attention.query_key_value.weight does not exist

Here is the modified script that I managed to reach "somewhere" with, after trying to resolve the errors via google:

!pip3 install transformers>=4.33.0 optimum>=1.12.0
!pip3 install auto-gptq --extra-index-url https://huggingface.github.io/autogptq-index/whl/cu118/ # Use cu117 if on CUDA 11.7
!pip3 install -U sagemaker

import json
import sagemaker
import boto3
import torch
from sagemaker.huggingface import HuggingFaceModel, get_huggingface_llm_image_uri

try:
role = sagemaker.get_execution_role()
except ValueError:
iam = boto3.client('iam')
role = iam.get_role(RoleName='sagemaker_execution_role')['Role']['Arn']

Hub Model configuration. https://huggingface.co/models

hub = {
'HF_MODEL_ID':'TheBloke/Falcon-180B-GPTQ',
'SM_NUM_GPUS': json.dumps(1),
'HF_HOME': '/tmp',
'HF_MODEL_QUANTIZE' : 'gptq',
'CUDA_LAUNCH_BLOCKING': '1'
}

create Hugging Face Model Class

huggingface_model = HuggingFaceModel(
image_uri=get_huggingface_llm_image_uri("huggingface",version="1.0.3"),
env=hub,
role=role
)

Clear GPU memory after prediction

torch.cuda.empty_cache()

deploy model to SageMaker Inference

predictor = huggingface_model.deploy(
initial_instance_count=1,
instance_type="ml.g5.2xlarge",
container_startup_health_check_timeout=1000,
)

send request

predictor.predict({
"inputs": "My name is Julien and I like to",
})

I am now receiving this error: RuntimeError: CUDA error: an illegal memory access was encountered
Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.

This is the point where I go into circles and cannot get out. Any advice or help will be welcomed! Thank you everyone up-front!

Sign up or log in to comment