Deploying on Amazon SageMaker

#14
by ivoschaper - opened

I'm trying to run this model with Amazon SageMaker and am able to successfully deploy it on a ml.m5.xlarge instance.

Unfortunately, when invoking the endpoint I get a PredictionException which says: Could not load model /.sagemaker/mms/models/google__flan-t5-xxl with any of the following classes: (transformers.models.auto.modeling_auto.AutoModelForSeqZSeqLMu0027, transformers.models.t5.modeling_t5.T5ForConditionalGeneration).

Has anyone had this issue or, even better, been able to deploy and use this model via SageMaker?

Interestingly, when I try to deploy and invoke the endpoint for the flan-t5-large model (with the same ml.m5.xlarge instance) I don't face any issues. I strongly suspect that this is related to the model size so I tried using the biggest instance SageMaker offers (and that is available to me) which is a ml.p3.8xlarge instance, yet I still face the same issue.

My code for deployment looks like this (very similar to what HuggingFace provides, but not exactly the same):

from sagemaker.huggingface import HuggingFaceModel
import boto3
import os

AWS_ACCESS_KEY_ID = os.environ["AWS_ACCESS_KEY_ID"]
AWS_SECRET_ACCESS_KEY = os.environ["AWS_SECRET_ACCESS_KEY"]


# Hub Model configuration. https://huggingface.co/models
hub = {
    'HF_MODEL_ID':'google/flan-t5-xxl',
    'HF_TASK':'text2text-generation'
}

iam_client = boto3.client('iam')

# IAM role
role = iam_client.get_role(RoleName='my-role-with-sagemaker-access')['Role']['Arn']

# create Hugging Face Model Class
huggingface_model = HuggingFaceModel(
    transformers_version='4.17.0', 
    pytorch_version='1.10.2',
    py_version='py38',
    env=hub,
    role=role,
)

# deploy model to SageMaker Inference
predictor = huggingface_model.deploy(
    initial_instance_count=1, # number of instances
    instance_type='ml.m5.xlarge' # ec2 instance type
)

And my invocation looks like this:

import boto3
from sagemaker.serializers import JSONSerializer
import json

client = boto3.client('sagemaker-runtime')

endpoint_name = "huggingface-pytorch-inference-XXXXXXXXX"

# The MIME type of the input data in the request body.
content_type = "application/json"

# The desired MIME type of the inference in the response.
accept = "application/json"

# Payload for inference.
payload = {
    "inputs": "The capital of Germany is",
    "parameters": {
        "temperature": 0.7,
    },
    "options": {
        "use_cache": False,
    },
} 

response = client.invoke_endpoint(
    EndpointName=endpoint_name,
    ContentType=content_type,
    Accept=accept,
    Body=JSONSerializer().serialize(payload)
    )

pred = json.loads(response['Body'].read())
print(pred)

prediction = pred[0]['generated_text']
print(prediction)

I'm encountering the same issue. Did you find a workaround?

Facing the same issue. Have raised a GitHub issue here
https://github.com/huggingface/transformers/issues/21402

Sign up or log in to comment