Deploying in SageMaker

#44
by alvaropp - opened

Hi there,

I'm experimenting with Dolly and I'm trying to deploy it in SageMaker. It all works fine but I'm struggling to run inference—there's something going on with the data format I'm passing, but cannot figure out what!

import json

import boto3
import sagemaker
from sagemaker.huggingface import HuggingFaceModel


# %% Deploy new model
role = sagemaker.get_execution_role()
hub = {"HF_MODEL_ID": "databricks/dolly-v2-12b", "HF_TASK": "text-generation"}

# Create Hugging Face Model Class
huggingface_model = HuggingFaceModel(
    transformers_version="4.17.0",
    pytorch_version="1.10.2",
    py_version="py38",
    env=hub,
    role=role,
)

# Deploy model to SageMaker Inference
predictor = huggingface_model.deploy(
    initial_instance_count=1,  # number of instances
    instance_type="ml.m5.xlarge",  # ec2 instance type
)

predictor.predict({"inputs": "Once upon a time there "})

results in:

ModelError: An error occurred (ModelError) when calling the InvokeEndpoint operation: Received client error (400) from primary with message "{
  "code": 400,
  "type": "InternalServerException",
  "message": "\u0027gpt_neox\u0027"
}

I've tried using json strings but no luck either.

Any help appreciated!
Cheers.

Databricks org

That's really a question for HF / Sagemaker, doesn't look related to this model per se

Hi there, you've got a few items to unpack here. First, you want to point to a more recent version of the transformers SDK, ideally one that has support for all of the model objects needed for dolly.

Second, this is a 12B parameter model. That means you are likely going to need more than one accelerator to host it. I'm testing this out on my end now, and will report back soon what seems to be the smallest number of accelerators. If you're compiling it, you need fewer.

Third, I would point to a hosting instance that uses accelerators, either inferentia (inf1) or NVIDIA (g's or p's).

I'll respond in a bit with more concrete guidance. In the meantime, Phillip has some great examples of doing this end-to-end here!

Error [ModelError]: Received client error (400) from primary with message "{ "code": 400, "type": "InternalServerException", "message": "\u0027gpt_neox\u0027" }

I got the same error when trying to run inference after deploying it as a SageMaker endpoint. I was trying to find out what went wrong and stumbled across this. I'm using dolly-v2-3b rather than 12b.

Databricks org

Just Googling, looks like this maybe (you need to tell it to use a newer transformers or something) https://towardsdatascience.com/unlock-the-latest-transformer-models-with-amazon-sagemaker-7fe65130d993

If I change my deployment configuration to update to the proper transformers version/pytorch/pyversion

huggingface_model = HuggingFaceModel(
    transformers_version='4.26.0',
    pytorch_version='1.13.1',
    py_version='py39',
    env=hub,
    role=role, 
)

I get a new error Load model failed: databricks__dolly-v2-3b, error: Worker died.

Databricks org

As above. Likely you aren't provisioning something too small for the model

Thanks for the responses.

I've been playing with EC2 directly—no SageMaker—and dolly-v2-12b runs fine on a p3.2xlarge instance (quick enough for my experiments, anyway!) running the following script:

import torch
from langchain import PromptTemplate, LLMChain
from langchain.llms import HuggingFacePipeline
from transformers import pipeline


print("Loading Dolly...")

generate_text = pipeline(
    model="databricks/dolly-v2-12b",
    torch_dtype=torch.bfloat16,
    trust_remote_code=True,
    device_map="auto",
    return_full_text=True,
)


print("Prompting Dolly...")

# template for an instruction with input
prompt_with_context = PromptTemplate(
    input_variables=["instruction", "context"],
    template="{instruction}\n\nInput:\n{context}",
)

hf_pipeline = HuggingFacePipeline(pipeline=generate_text)

llm_context_chain = LLMChain(llm=hf_pipeline, prompt=prompt_with_context)

context = """George Washington (February 22, 1732 - December 14, 1799) was an American military officer, statesman,
and Founding Father who served as the first president of the United States from 1789 to 1797."""

print(
    llm_context_chain.predict(
        instruction="When was George Washington president?", context=context
    ).lstrip()
)

Now, back to SageMaker: I've then updated dependency versions as per the comments above, and I'm now getting a new error regarding running out of disk space. I'm using the following code:

import json

import boto3
import sagemaker
from sagemaker.huggingface import HuggingFaceModel

role = sagemaker.get_execution_role()
hub = {
    'HF_MODEL_ID':'databricks/dolly-v2-12b',
    'HF_TASK':'text-generation'
}

# Create Hugging Face Model Class
huggingface_model = HuggingFaceModel(
    transformers_version='4.26.0',
    pytorch_version='1.13.1',
    py_version='py39',
    env=hub,
    role=role,
)

# Deploy model to SageMaker Inference
predictor = huggingface_model.deploy(
    initial_instance_count=1,
    instance_type="ml.p3.2xlarge",
    volume_size=512,
)

This instance should have 512GB of storage, more than enough for dolly-v2-12b so not sure what's going on.

Cheers!

I am trying to deploy the dolly-v2-12b in to sagemaker. when trying to run inference running in to below errors.

from sagemaker.huggingface import HuggingFaceModel
import sagemaker

role = sagemaker.get_execution_role()
hub = {
'HF_MODEL_ID': 'databricks/dolly-v2-12b',
'HF_TASK': 'text-generation',
}

huggingface_model = HuggingFaceModel(
transformers_version='4.17.0',
pytorch_version='1.10.2',
py_version='py38',
env=hub,
role=role
)

predictor = huggingface_model.deploy(
initial_instance_count=1,
instance_type='ml.m5.xlarge',
)

sample_input = {
'inputs': 'Can you please let us know more details about your'
}

output = predictor.predict(sample_input)
print(output)

This is leading to,
ModelError: An error occurred (ModelError) when calling the InvokeEndpoint operation: Received client error (400) from primary with message "{
"code": 400,
"type": "InternalServerException",
"message": "\u0027gpt_neox\u0027"
}

from sagemaker.huggingface import HuggingFaceModel
import sagemaker

role = sagemaker.get_execution_role()
hub = {
'HF_MODEL_ID': 'databricks/dolly-v2-12b',
'HF_TASK': 'text-generation',
}

huggingface_model = HuggingFaceModel(
transformers_version='4.26.0',
pytorch_version='1.13.1',
py_version='py39',
env=hub,
role=role,
)

predictor = huggingface_model.deploy(
initial_instance_count=1,
instance_type='ml.m5.xlarge',
)

sample_input = {
'inputs': 'Can you please let us know more details about your'
}

output = predictor.predict(sample_input)
print(output)

This is leading to,
ModelError: An error occurred (ModelError) when calling the InvokeEndpoint operation: Received client error (400) from primary with message "{
"code": 400,
"type": "InternalServerException",
"message": "Loading this pipeline requires you to execute the code in the pipeline file in that repo on your local machine. Make sure you have read the code there to avoid malicious use, then set the option trust_remote_code\u003dTrue to remove this error."
}

I am not sure what is missing.

Any help appreciated!

Databricks org

Again , your hardware is far too small for this model. An m5.xlarge doesn't even have a GPU. See above.
That isn't the problem here. I'm not sure anyone has figured out here how to set trust_remote_code=True, which is needed to load the model's pipeline, in the SM integration.

srowen changed discussion status to closed

I was able to set trust_remote_code=True by overriding the default method for loading a model following documentation here https://huggingface.co/docs/sagemaker/inference#user-defined-code-and-modules.

I created an inference.py with the following code:

from transformers import pipeline
import torch

def model_fn(model_dir):
    """
    Overrides the default model load function in the HuggingFace Deep Learning Container
    """
    instruct_pipeline = pipeline(model="databricks/dolly-v2-3b", torch_dtype=torch.bfloat16, trust_remote_code=True, device_map="auto")
    return instruct_pipeline

and requirements.txt with:

accelerate==0.18.0

Then I followed instructions here for creating a model artifact and uploaded to s3. Then you can deploy an endpoint with:

from sagemaker.huggingface.model import HuggingFaceModel

# create Hugging Face Model Class
huggingface_model = HuggingFaceModel(
   model_data="s3://your_bucket/your_dolly_path/model.tar.gz",  # path to your trained SageMaker model
   role=role,                                            # IAM role with permissions to create an endpoint
   transformers_version="4.26.0",                           # Transformers version used
   pytorch_version="1.13.1",                                # PyTorch version used
   py_version='py39',                                    # Python version used
)

# deploy model to SageMaker Inference
predictor = huggingface_model.deploy(
   initial_instance_count=1,
   instance_type="ml.g5.4xlarge"
)

Note: I tested this with the databricks/dolly-v2-3b model, so the ml.g5.4xlarge may not be enough for the larger models

Here's a gist showing a working method for deploying the dolly-v2-12b model on a g5.4xlarge instance.

https://gist.github.com/timesler/4b244a6b73d6e02d17fd220fd92dfaec

@alvaropp I believe the issue with running out of disk space was because the 512GB disk mount on SageMaker is at /home/ec2-user/SageMaker, but HuggingFace libraries default to storing files in a cache at /home/ec2-user/.cache/.... The solution is to set the HF_HOME env var to a location under /home/ec2-user/SageMaker. Importantly, if you set the env var in python, make sure you do it before importing HuggingFace libraries to make sure it gets used. I've included that in the linked gist.

To get the 12b model running on a g5.4xlarge instance, I think you'll also need to set load_in_8bit to True.

@timesler , @janeth8 many thanks for the response, that makes sense!

Right, so I've followed @timesler 's instructions and I'm running into the following error, which seems to be some sort of overflow:

ModelError: An error occurred (ModelError) when calling the InvokeEndpoint operation: Received client error (400) from primary with message "{
  "code": 400,
  "type": "InternalServerException",
  "message": "probability tensor contains either `inf`, `nan` or element \u003c 0"
}

I'm using a ml.p3.8xlarge instance, which is perfectly capable of running dolly-v2-12b in my experiments using EC2 directly, without SageMaker.

That's great, thanks!

After a bit of trial an error, noticed that @timesler 's code (https://gist.github.com/timesler/4b244a6b73d6e02d17fd220fd92dfaec) works perfectly fine as well.

I'm not 100% sure of why it works on g5.4xlarge and not on ml.p3.8xlarge—they seem to have similar specs!

Sign up or log in to comment