Running Any HuggingFace Model on SageMaker Endpoint: Walk-Through with Cross Encoder Model Example

Community Article Published December 14, 2023


In this document, we will go through step-by-step guidance on how to instantiate any HugggingFace model as SageMaker endpoint. It is applicable to any models, beyond models that support text-generation or text2text-generation tasks. We will use as an example.

Sample code is tested on OSX and us-west-2 AWS region.

Infrastructural Overview

  1. TorchServe allows you to run a HuggingFace model as a web server.
  2. AWS team created to make it easy to run TorchServe as SageMaker endpoint.
  3. AWS team also created container images on ECR using Dockerfiles. One of them,, uses sagemaker-pytorch-inference-toolkit.
  4. On AWS SageMaker,
    1. You create SageMaker model by specifying a TorchServe-based docker image and model zip location in S3 bucket.
    2. You create SageMaker endpoint configuration by selecting the model and desired instance type to run.
    3. You create SageMaker endpoint from the endpoint configuration. This is the actual web service instance.

To run any model on SageMaker endpoint, all you need to know is how to create the model zip file (step #4-1), which is the main topic of this document.

Steps on Running a HuggingFace Model on SageMaker Endpoint

  1. Figure out how to use the model in barebone Python environment such as SageMaker Notebook terminal.
  2. From #1, write and test locally.
  3. Package #2 as zip file and upload it to S3 bucket.
  4. Create SageMaker model, endpoint configuration, and then endpoint. Test.
  5. Write a client helper code for easy consumption of the service. Test.
  6. (Optional) Revise package zip file to include model binary.

We will go through them in the following section.

1. Figure Out How to Use the Model

From the HuggingFace documentation of BAAI/bge-reranker-base model, we figure out how to use the model.

  1. Launch python3.
  2. Copy and paste the following and ensure it works.
    • Take a note on packages you had to install to make it work. This will be used later for defining requirements.txt.
import torch
from transformers import AutoModelForSequenceClassification, AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained('BAAI/bge-reranker-base')
model = AutoModelForSequenceClassification.from_pretrained('BAAI/bge-reranker-base')

pairs = [['I love you', 'i like you'], ['I love you', 'i hate you']]
with torch.no_grad():
    inputs = tokenizer(pairs, padding=True, truncation=True, return_tensors='pt', max_length=512)
    scores = model(**inputs, return_dict=True).logits.view(-1, ).float()

2. Write

Here is the sample content for Most of what you had on step #1 is in CrossEncoder class.

import json
import logging
import torch
from typing import List
from sagemaker_inference import encoder
from transformers import AutoModelForSequenceClassification, AutoTokenizer

PAIRS = "pairs"
SCORES = "scores"

class CrossEncoder:
    def __init__(self) -> None:
        self.device = torch.device("cuda") if torch.cuda.is_available() else torch.device("cpu")"Using device: {self.device}")
        model_name = 'BAAI/bge-reranker-base'
        self.tokenizer = AutoTokenizer.from_pretrained(model_name)
        self.model = AutoModelForSequenceClassification.from_pretrained(model_name)
        self.model =

    def __call__(self, pairs: List[List[str]]) -> List[float]:
        with torch.inference_mode():
            inputs = self.tokenizer(pairs, padding=True, truncation=True, return_tensors='pt', max_length=512)
            inputs =
            scores = self.model(**inputs, return_dict=True).logits.view(-1, ).float()

        return scores.detach().cpu().tolist()

def model_fn(model_dir: str) -> CrossEncoder:
        return CrossEncoder()
    except Exception:
        logging.exception(f"Failed to load model from: {model_dir}")

def transform_fn(cross_encoder: CrossEncoder, input_data: bytes, content_type: str, accept: str) -> bytes:
    payload = json.loads(input_data)
    model_output = cross_encoder(**payload)
    output = {SCORES: model_output}
    return encoder.encode(output, accept)

In, we are defining model_fn() and transform_fn(). I put brief explanation below - for more information, refer to

2.1. model_fn()

This function is responsible for loading the model and returning the reference.

CrossEncoder class downloads the needed model on the fly. This makes the model file super slim, at the expense of runtime dependency to HuggingFace service. SageMaker jumpstart model zip files have the model binary in the S3 zip file. We will cover how to do this in later section.

2.2. transform_fn()

Here you define how the request payload will be parsed and what the output will be like.

Note that CrossEncoder class’s __init__() is called only once by model_fn(), whereas call() is called on every transform_fn() calls.

2.3. Testing

On your local terminal with python installed, run python3 -i and run:

model = model_fn("")
transform_fn(model, "{\"pairs\": [[\"I love you\", \"i like you\"], [\"I love you\", \"i hate you\"]]}", "application/json", "application/json")

And you should get the same score as you got from the earlier test.

3. Package the Model and Upload to S3 Bucket

  1. Create model package root folder.
  2. Create code subfolder under the root folder.
  3. On code folder, put along with, requirements.txt and version.
  4. At the model package root folder, zip and upload the model package.

Sample directory structure for #3:

<model package root>
└── code
    ├──          # the content is empty
    ├──         # the content is from step #2. Write ``
    ├── requirements.txt
    └── version              # the content can be a one-line string "1.0.0"

Note that requirements.txt doesn't have to list a package that was needed for your local run if it is already included in the container image. Here is the sample content for requirements.txt.


Sample code for getting #4 done:

tar zcvf BAAI_bge-reranker-base.tar.gz *
aws s3 cp BAAI_bge-reranker-base.tar.gz s3://<<YOUR_S3_BUCKET_NAME>>/huggingface-models/

4. Create SageMaker Endpoint

  1. Create a model.
    1. Login to AWS console, and open
    2. Click Create model button.
    3. On Location of inference code image, put
    4. On Location of model artifacts //- optional//, put the S3 path from step #3 Package the Model and Upload to S3 Bucket.
  2. Create endpoint configuration.
    2. Click Create endpoint configuration.
    3. Click Create production variant and select the model you created in step #1.
  3. Create endpoint.
    2. Click Create endpoint.
    3. Select the Endpoint configuration you created in step #2.

4.1. Testing

Once the endpoint is InService status, run the following on your Python terminal and confirm you get the same score as you got from earlier tests.

import boto3, json
session = boto3.Session()
client = session.client("sagemaker-runtime", region_name="us-west-2")
output = client.invoke_endpoint(EndpointName="my-bge-reranker-base", Body="{\"pairs\": [[\"I love you\", \"i like you\"], [\"I love you\", \"i hate you\"]]}", ContentType="application/json")

4.2. Troubleshooting Tip

Quick Tip for Updating Your Model: If you're tweaking the model package, like the '', there's no need to start over. Just update the model zip file in your S3 bucket, then delete and recreate the endpoint with the existing configuration. This approach saves time and effort.

4.3. Troubleshooting: The endpoint service does not respond.

It should respond as fast as your local run. Otherwise, it is most likely because the model couldn’t launch.

  1. On AWS Console, open SageMaker endpoint page.
  2. Click Model container logs link.
  3. Check the log and see what’s wrong.

At this point, the most likely cause is a missing package, which can be addressed by modifying requirements.txt.

4.4. Troubleshooting: The endpoint service responds but slow.

Make sure it uses GPU. code"Using device: {self.device}") should write cuda. If it writes cpu, it means the endpoint is not utilizing GPU.

This can happen for various reasons. Once case I experienced was due to incorrect torch version. I took torch out of requirements.txt and the problem was resolved when the endpoint was using torch from the container image.

4.5. Troubleshooting: Any other issues

In you can put debugging messages.

You could also download the container and run the model locally. If you are going this far, note that your model package root should be mapped to /opt/ml/model in the docker container. (In other words, should be located at /opt/ml/model/code/

5. Write a Client Wrapper

I created the following code for easier consumption from the client side.

import json
from typing import Any, Dict, List, Optional

from langchain.pydantic_v1 import BaseModel, Extra, root_validator
from langchain.schema.cross_encoder import CrossEncoder

class CrossEncoderContentHandler:
    """Content handler for CrossEncoder class."""
    content_type = "application/json"
    accepts = "application/json"

    def transform_input(self, pairs: List[List[str]]) -> bytes:
        input_str = json.dumps({"pairs": pairs})
        return input_str.encode('utf-8')

    def transform_output(self, output: Any) -> List[float]:
        response_json = json.loads("utf-8"))
        scores = response_json["scores"]
        return scores

class SagemakerEndpointCrossEncoder(BaseModel):
    client: Any  #: :meta private:

    endpoint_name: str = ""
    region_name: str = ""
    credentials_profile_name: Optional[str] = None
    content_handler: CrossEncoderContentHandler = CrossEncoderContentHandler()
    model_kwargs: Optional[Dict] = None
    endpoint_kwargs: Optional[Dict] = None

    class Config:
        extra = Extra.forbid
        arbitrary_types_allowed = True

    def validate_environment(cls, values: Dict) -> Dict:
        """Validate that AWS credentials to and python package exists in environment."""
        import boto3

        if values["credentials_profile_name"] is not None:
            session = boto3.Session(
            # use default credentials
            session = boto3.Session()

        values["client"] = session.client(
            "sagemaker-runtime", region_name=values["region_name"]
        return values

    def score(self, pairs: List[List[str]]) -> List[float]:
        """Call out to SageMaker Inference CrossEncoder endpoint."""
        _model_kwargs = self.model_kwargs or {}
        _endpoint_kwargs = self.endpoint_kwargs or {}

        body = self.content_handler.transform_input(pairs)
        content_type = self.content_handler.content_type
        accepts = self.content_handler.accepts

        # send request
            response = self.client.invoke_endpoint(
        except Exception as e:
            raise ValueError(f"Error raised by inference endpoint: {e}")

        return self.content_handler.transform_output(response["Body"])

def _setup_sagemaker_endpoint_for_cross_encoder(reranker_endpoint_name: str,
                                                 region: str) -> Callable:
    sm_llm = SagemakerEndpointCrossEncoder(
    return sm_llm

Test: confirm the result from llm.score() matches with previous tests.

llm = _setup_sagemaker_endpoint_for_cross_encoder("my-bge-reranker-base", "us-west-2")
llm.score([["I love you", "i like you"], ["I love you", "i hate you"]])

6. (Optional) Revise package zip file to include model binary

For an example of how to include the model in the zip file, refer to other Jumpstart models. You can get the list from S3 bucket by running the following:

aws s3 ls s3://jumpstart-cache-prod-us-west-2/huggingface-infer/prepack/ --recursive on them also includes sample code on how to pass kwargs to the model instantiation or adding parameter validation logic.

Including the model binary is the standard practice from SageMaker Jumpstart. Personally, I am not sure if this is necessarily better practice since it takes longer to change the image or load the endpoint.

you need to modify code/ to load the model from current path, not to download from HuggingFace.

class CrossEncoder:
    def __init__(self, model_dir: str, **kwargs: Any) -> None:
        self.device = torch.device("cuda") if torch.cuda.is_available() else torch.device("cpu")"Using device: {self.device}")

        self.tokenizer = AutoTokenizer.from_pretrained(model_dir)
        self.model = AutoModelForSequenceClassification.from_pretrained(model_dir)
        self.model =
def model_fn(model_dir: str) -> CrossEncoder:
        return CrossEncoder(model_dir)

Then create with the following content, place it to the model package root folder, and run it. The new model package will be uploaded to the S3 bucket.

rm -rf build
mkdir build
cd build
git clone
cp -r ../code bge-reranker-base/
cd bge-reranker-base
tar zcvf BAAI_bge-reranker-base.tar.gz *
aws s3 cp BAAI_bge-reranker-base.tar.gz s3://<<YOUR_S3_BUCKET_NAME>>/huggingface-models/

Make sure the endpoint test from 4. Create SageMaker Endpoint still works.