Deploy models to Amazon SageMaker

Here's how easy it is to use the new SageMaker Hugging Face Inference Toolkit to deploy πŸ€— Transformers models in SageMaker:

from sagemaker.huggingface import HuggingFaceModel

# create Hugging Face Model Class and deploy it as SageMaker Endpoint
huggingface_model = HuggingFaceModel(...).deploy()

Overview

SageMaker Hugging Face Inference Toolkit

In addition to the Hugging Face Inference Deep Learning Containers, we created a new Inference Toolkit for SageMaker. This new Inference Toolkit leverages the pipelines from the transformers library to allow zero-code deployments of models, without requiring any code for pre- or post-processing. In the "Getting Started" section below, you will find two examples of how to deploy your models to Amazon SageMaker.

In addition to zero-code deployment, the Inference Toolkit supports "bring your own code" methods, where you can override the default methods. You can learn more about "bring your own code" in the documentation here, or you can check out the sample notebook "deploy custom inference code to Amazon SageMaker".

Inference Toolkit - API Description

Using the transformers pipelines, we designed an API, which makes it easy for you to benefit from all pipelines features. The API has a similar interface than the πŸ€— Accelerated Inference API hosted service: your inputs need to be defined in the inputs key, and additional supported pipelines parameters can be added in the parameters key. You can provide as parameters any supported kwargs for your transformers pipeline Below, you can find request examples.

text-classification sentiment-analysis token-classification feature-extraction fill-mask summarization translation_xx_to_yy text2text-generation text-generation

{
  "inputs": "Camera - You are awarded a SiPix Digital Camera! call 09061221066 fromm landline. Delivery within 28 days."
}

question-answering

{
  "inputs": {
    "question": "What is used for inference?",
    "context": "My Name is Philipp and I live in Nuremberg. This model is used with sagemaker for inference."
  }
}

zero-shot-classification

{
  "inputs": "Hi, I recently bought a device from your company but it is not working as advertised and I would like to get reimbursed!",
  "parameters": {
    "candidate_labels": ["refund", "legal", "faq"]
  }
}

table-question-answering

{
  "inputs": {
    "query": "How many stars does the transformers repository have?",
    "table": {
      "Repository": ["Transformers", "Datasets", "Tokenizers"],
      "Stars": ["36542", "4512", "3934"],
      "Contributors": ["651", "77", "34"],
      "Programming language": ["Python", "Python", "Rust, Python and NodeJS"]
    }
  }
}

paramterized-request

{
    "inputs": "Hugging Face, the winner of VentureBeat’s Innovation in Natural Language Process/Understanding Award for 2021, is looking to level the playing field. The team, launched by ClΓ©ment Delangue and Julien Chaumond in 2016, was recognized for its work in democratizing NLP, the global market value for which is expected to hit $35.1 billion by 2026. This week, Google’s former head of Ethical AI Margaret Mitchell joined the team.",
    "paramters": {
        "repetition_penalty": 4.0,
        "length_penalty": 1.5
    }
}

Setup & Installation

Before you can deploy a πŸ€— Transformers model to Amazon SageMaker you need to sign up for an AWS account. If you do not have an AWS account yet learn more here.

After you complete these tasks you can get started using either SageMaker Studio, SageMaker Notebook Instances, or a local environment. To deploy from your local machine you need to configure the right IAM permission.

Upgrade to the latest sagemaker version.

pip install sagemaker --upgrade

SageMaker environment

Note: The execution role is intended to be available only when running a notebook within SageMaker. If you run get_execution_role in a notebook not on SageMaker, expect a "region" error.

import sagemaker
sess = sagemaker.Session()
role = sagemaker.get_execution_role()

Local environment

import sagemaker
import boto3

iam_client = boto3.client('iam')
role = iam_client.get_role(RoleName='role-name-of-your-iam-role-with-right-permissions')['Role']['Arn']
sess = sagemaker.Session()

Deploy for inference a πŸ€— Transformers model trained in SageMaker

There are two ways to deploy your Hugging Face model trained in SageMaker. You can either deploy it after your training is finished, or you can deploy it later, using the model_data pointing to your saved model on S3.

Deploy the model directly after training

If you deploy your model directly after training, you need to ensure that all required files are saved in your training script, including the Tokenizer and the Model.

If you use the Trainer API you can pass your Tokenizer as argument to the Trainer, it will be then automatically saved when calling Trainer.save_model().

from sagemaker.huggingface import HuggingFace

############ pseudo code start ############

# create HuggingFace estimator for running training
huggingface_estimator = HuggingFace(....)

# starting the train job with our uploaded datasets as input
huggingface_estimator.fit(...)

############ pseudo code end ############

# deploy model to SageMaker Inference
predictor = hf_estimator.deploy(initial_instance_count=1, instance_type="ml.m5.xlarge")

# example request, you always need to define "inputs"
data = {
   "inputs": "Camera - You are awarded a SiPix Digital Camera! call 09061221066 fromm landline. Delivery within 28 days."
}
# request
predictor.predict(data)

After we run our request we can delete the endpoint again with.

# delete endpoint
predictor.delete_endpoint()

Deploy the model using model_data

If you've already trained your model and want to deploy it at some later time, you can use the model_data argument to specify the location of your tokenizer and model weights.

from sagemaker.huggingface.model import HuggingFaceModel

# create Hugging Face Model Class
huggingface_model = HuggingFaceModel(
   model_data="s3://models/my-bert-model/model.tar.gz",  # path to your trained sagemaker model
   role=role, # iam role with permissions to create an Endpoint
   transformers_version="4.6", # transformers version used
   pytorch_version="1.7", # pytorch version used
   py_version='py36', # python version used
)
# deploy model to SageMaker Inference
predictor = huggingface_model.deploy(
   initial_instance_count=1,
   instance_type="ml.m5.xlarge"
)

# example request, you always need to define "inputs"
data = {
   "inputs": "Camera - You are awarded a SiPix Digital Camera! call 09061221066 fromm landline. Delivery within 28 days."
}

# request
predictor.predict(data)

After we run our request, we can delete the endpoint again with:

# delete endpoint
predictor.delete_endpoint()

Deploy for inference one of the 10,000+ πŸ€— Transformers models available in the πŸ€— Hub

To deploy a model directly from the πŸ€— Hub to SageMaker, we need to define 2 environment variables when creating the HuggingFaceModel. We need to define:

  • HF_MODEL_ID: defines the model id, which will be automatically loaded from huggingface.co/models when creating or SageMaker Endpoint. The πŸ€— Hub provides 10,000+ models, all available through this environment variable.
  • HF_TASK: defines the task for the πŸ€— Transformers pipeline used. A full list of tasks can be found here.
from sagemaker.huggingface.model import HuggingFaceModel

# Hub Model configuration. <https://huggingface.co/models>
hub = {
  'HF_MODEL_ID':'distilbert-base-uncased-distilled-squad', # model_id from hf.co/models
  'HF_TASK':'question-answering' # NLP task you want to use for predictions
}

# create Hugging Face Model Class
huggingface_model = HuggingFaceModel(
   env=hub, # configuration for loading model from Hub
   role=role, # iam role with permissions to create an Endpoint
   transformers_version="4.6", # transformers version used
   pytorch_version="1.7", # pytorch version used
   py_version='py36', # python version used
)

# deploy model to SageMaker Inference
predictor = huggingface_model.deploy(
   initial_instance_count=1,
   instance_type="ml.m5.xlarge"
)

# example request, you always need to define "inputs"
data = {
"inputs": {
    "question": "What is used for inference?",
    "context": "My Name is Philipp and I live in Nuremberg. This model is used with sagemaker for inference."
    }
}

# request
predictor.predict(data)

After we run our request, we can delete the endpoint again with.

# delete endpoint
predictor.delete_endpoint()

Run a Batch Transform Job using πŸ€— Transformers and Amazon SageMaker

After you train a model, you can use Amazon SageMaker Batch Transform to perform inferences with the model. In Batch Transform you provide your inference data as an S3 URI and SageMaker will take care of downloading it, running the prediction, and uploading the results afterward to S3 again. You can find more documentation for Batch Transform here

The Hugging Face Inference DLC currently only supports .jsonl for batch transform, due to the complex structure of textual data.

NOTE: While preprocessing, you need to make sure that your inputs fit the max_lengthof the model.

If you trained the model using the HuggingFace estimator, you can invoke transformer() method to create a transform job for a model based on the training job.

batch_job = huggingface_estimator.transformer(
    instance_count=1,
    instance_type='ml.p3.2xlarge',
    strategy='SingleRecord')


batch_job.transform(
    data='s3://s3-uri-to-batch-data',
    content_type='application/json',    
    split_type='Line')

For more details about what can be specified here, see API docs.

If you want to run your Batch Transform Job later or with a model from hf.co/models you can do this by creating a HuggingFaceModel instance and then using the transformer() method.

from sagemaker.huggingface.model import HuggingFaceModel

# Hub Model configuration. <https://huggingface.co/models>
hub = {
    'HF_MODEL_ID':'distilbert-base-uncased-finetuned-sst-2-english',
    'HF_TASK':'text-classification'
}

# create Hugging Face Model Class
huggingface_model = HuggingFaceModel(
   env=hub, # configuration for loading model from Hub
   role=role, # iam role with permissions to create an Endpoint
   transformers_version="4.6", # transformers version used
   pytorch_version="1.7", # pytorch version used
       py_version='py36', # python version used
)

# create Transformer to run our batch job
batch_job = huggingface_model.transformer(
    instance_count=1,
    instance_type='ml.p3.2xlarge',
    strategy='SingleRecord')

# starts batch transform job and uses s3 data as input
batch_job.transform(
    data='s3://sagemaker-s3-demo-test/samples/input.jsonl',
    content_type='application/json',    
    split_type='Line')

The input.jsonl looks like this:

{"inputs":"this movie is terrible"}
{"inputs":"this movie is amazing"}
{"inputs":"SageMaker is pretty cool"}
{"inputs":"SageMaker is pretty cool"}
{"inputs":"this movie is terrible"}
{"inputs":"this movie is amazing"}

Advanced Features

Environment variables

The SageMaker Hugging Face Inference Toolkit implements various additional environment variables to simplify your deployment experience. A full list of environment variables is given below.

HF_TASK

The HF_TASK environment variable defines the task for the πŸ€— Transformers pipeline used . A full list of tasks can be find here.

HF_TASK="question-answering"

HF_MODEL_ID

The HF_MODEL_ID environment variable defines the model id, which will be automatically loaded from huggingface.co/models when creating or SageMaker Endpoint. The πŸ€— Hub provides 10,000+ models, all available through this environment variable.

HF_MODEL_ID="distilbert-base-uncased-finetuned-sst-2-english"

HF_MODEL_REVISION

The HF_MODEL_REVISION is an extension to HF_MODEL_ID and allows you to define/pin a revision of the model to make sure you always load the same model on your SageMaker Endpoint.

HF_MODEL_REVISION="03b4d196c19d0a73c7e0322684e97db1ec397613"

HF_API_TOKEN

The HF_API_TOKEN environment variable defines your Hugging Face authorization token. The HF_API_TOKEN is used as a HTTP bearer authorization for remote files, like private models. You can find your token at your settings page.

HF_API_TOKEN="api_XXXXXXXXXXXXXXXXXXXXXXXXXXXXX"

Creating a Model artifact model.tar.gz for deployment

As shown in Deploy the model using model_data you can either deploy your model directly after training or creating a model.tar.gz later and using it for deployment. The model.tar.gz contains all required files to run your model including your model file either pytorch_model.bin, tf_model.h5, tokenizer.json , tokenizer_config.json etc. All model artifacts need to be directly in the archive without folder hierarchy.

examples for PyTorch:

model.tar.gz/
|- pytroch_model.bin
|- vocab.txt
|- tokenizer_config.json
|- config.json
|- special_tokens_map.json

Steps how to create a model.tar.gz from a model of hf.co/models

  1. Download the model

    git lfs install
    git clone https://huggingface.co/{repository}
    
  2. Create a tar file

    cd {repository}
    tar zcvf model.tar.gz *
    
  3. Upload model.tar.gz to s3

    aws s3 cp model.tar.gz <s3://{my-s3-path}>
    

After that you can use the S3 uri as model_data.

User defined code/modules

The Hugging Face Inference Toolkit allows user to override the default methods of the HuggingFaceHandlerService. Therefor the need to create a named code/ with a inference.py file in it. For example:

model.tar.gz/
|- pytroch_model.bin
|- ....
|- code/
  |- inference.py
  |- requirements.txt 

In this example, pytroch_model.bin is the model file saved from training, inference.py is the custom inference module, and requirements.txt is a requirements file to add additional dependencies. The custom module can override the following methods:

  • model_fn(model_dir): Overrides the default method for loading the model, the return value model will be used in the predict() for predicitions. It receives argument the model_dir, the path to your unzipped model.tar.gz.
  • transform_fn(model, data, content_type, accept_type): Overrides the default transform function with custom implementation. Customers using this would have to implement preprocess, predict and postprocess steps in the transform_fn. NOTE: This method can't be combined with input_fn, predict_fn or output_fn mentioned below.
  • input_fn(input_data, content_type): Overrides the default method for prerprocessing, the return value data will be used in the predict() method for predicitions. The input is input_data, the raw body of your request and content_type, the content type form the request Header.
  • predict_fn(processed_data, model): Overrides the default method for predictions, the return value predictions will be used in the postprocess() method. The input is processed_data, the result of the preprocess() method.
  • output_fn(prediction, accept): Overrides the default method for postprocessing, the return value result will be the respond of your request(e.g.JSON). The inputs are predictions, the result of the predict() method and accept the return accept type from the HTTP Request, e.g. application/json

Example of an inference.py with model_fn, input_fn, predict_fn & output_fn:

def model_fn(model_dir):
    return "model"


def input_fn(data, content_type):
    return "data"


def predict_fn(data, model):
    return "output"


def output_fn(prediction, accept):
    return prediction

Example of an inference.py with model_fn & transform_fn:

def model_fn(model_dir):
    return "loading model"


def transform_fn(model, input_data, content_type, accept):
    return f"output"

Additional Resources