Hugging Face Text Generation Inference available for AWS Inferentia2

Published February 1, 2024
Update on GitHub

We are excited to announce the general availability of Hugging Face Text Generation Inference (TGI) on AWS Inferentia2 and Amazon SageMaker.

Text Generation Inference (TGI), is a purpose-built solution for deploying and serving Large Language Models (LLMs) for production workloads at scale. TGI enables high-performance text generation using Tensor Parallelism and continuous batching for the most popular open LLMs, including Llama, Mistral, and more. Text Generation Inference is used in production by companies such as Grammarly, Uber, Deutsche Telekom, and many more.

The integration of TGI into Amazon SageMaker, in combination with AWS Inferentia2, presents a powerful solution and viable alternative to GPUs for building production LLM applications. The seamless integration ensures easy deployment and maintenance of models, making LLMs more accessible and scalable for a wide range of production use cases.

With the new TGI for AWS Inferentia2 on Amazon SageMaker, AWS customers can benefit from the same technologies that power highly-concurrent, low-latency LLM experiences like HuggingChatOpenAssistant, and Serverless Endpoints for LLMs on the Hugging Face Hub.

Deploy Zephyr 7B on AWS Inferentia2 using Amazon SageMaker

This tutorial shows how easy it is to deploy a state-of-the-art LLM, such as Zephyr 7B, on AWS Inferentia2 using Amazon SageMaker. Zephyr is a 7B fine-tuned version of mistralai/Mistral-7B-v0.1 that was trained on a mix of publicly available and synthetic datasets using Direct Preference Optimization (DPO), as described in detail in the technical report. The model is released under the Apache 2.0 license, ensuring wide accessibility and use.

We are going to show you how to:

  1. Setup development environment
  2. Retrieve the TGI Neuronx Image
  3. Deploy Zephyr 7B to Amazon SageMaker
  4. Run inference and chat with the model

Let’s get started.

1. Setup development environment

We are going to use the sagemaker python SDK to deploy Zephyr to Amazon SageMaker. We need to make sure to have an AWS account configured and the sagemaker python SDK installed.

!pip install transformers "sagemaker>=2.206.0" --upgrade --quiet

If you are going to use Sagemaker in a local environment. You need access to an IAM Role with the required permissions for Sagemaker. You can find out more about it here.

import sagemaker
import boto3
sess = sagemaker.Session()
# sagemaker session bucket -> used for uploading data, models and logs
# sagemaker will automatically create this bucket if it doesn't exist
if sagemaker_session_bucket is None and sess is not None:
    # set to default bucket if a bucket name is not given
    sagemaker_session_bucket = sess.default_bucket()

    role = sagemaker.get_execution_role()
except ValueError:
    iam = boto3.client('iam')
    role = iam.get_role(RoleName='sagemaker_execution_role')['Role']['Arn']

sess = sagemaker.Session(default_bucket=sagemaker_session_bucket)

print(f"sagemaker role arn: {role}")
print(f"sagemaker session region: {sess.boto_region_name}")

2. Retrieve TGI Neuronx Image

The new Hugging Face TGI Neuronx DLCs can be used to run inference on AWS Inferentia2. You can use the get_huggingface_llm_image_uri method of the sagemaker SDK to retrieve the appropriate Hugging Face TGI Neuronx DLC URI based on your desired backend, session, region, and version. You can find all the available versions here.

Note: At the time of writing this blog post the latest version of the Hugging Face LLM DLC is not yet available via the get_huggingface_llm_image_uri method. We are going to use the raw container uri instead.

from sagemaker.huggingface import get_huggingface_llm_image_uri

# retrieve the llm image uri
llm_image = get_huggingface_llm_image_uri(

# print ecr image uri
print(f"llm image uri: {llm_image}")

4. Deploy Zephyr 7B to Amazon SageMaker

Text Generation Inference (TGI) on Inferentia2 supports popular open LLMs, including Llama, Mistral, and more. You can check the full list of supported models (text-generation) here.

Compiling LLMs for Inferentia2

At the time of writing, AWS Inferentia2 does not support dynamic shapes for inference, which means that we need to specify our sequence length and batch size ahead of time. To make it easier for customers to utilize the full power of Inferentia2, we created a neuron model cache, which contains pre-compiled configurations for the most popular LLMs. A cached configuration is defined through a model architecture (Mistral), model size (7B), neuron version (2.16), number of inferentia cores (2), batch size (2), and sequence length (2048).

This means compiling fine-tuned checkpoints for Mistral 7B with the same configuration will take only a few minutes. Examples of this are mistralai/Mistral-7B-v0.1 and HuggingFaceH4/zephyr-7b-beta.

Note: Currently, TGI can only load compiled checkpoints and models. We are working on an on-the-fly compilation based on the cache. This means that you'll be able to pass any model ID from the Hugging face Hub, e.g., HuggingFaceH4/zephyr-7b-beta as long as there is a cached configuration. This post will be updated when this feature is released.

For this post we compiled HuggingFaceH4/zephyr-7b-beta using the following command and parameters on a inf2.8xlarge instance, and pushed it to the Hub at aws-neuron/zephyr-7b-seqlen-2048-bs-4-cores-2

# compile model with optimum for batch size 4 and sequence length 2048
optimum-cli export neuron -m HuggingFaceH4/zephyr-7b-beta --batch_size 4 --sequence_length 2048 --num_cores 2 --auto_cast_type bf16 ./zephyr-7b-beta-neuron
# push model to hub [repo_id] [local_path] [path_in_repo]
huggingface-cli upload  aws-neuron/zephyr-7b-seqlen-2048-bs-4 ./zephyr-7b-beta-neuron ./ --exclude "checkpoint/**"
# Move tokenizer to neuron model repository
python -c "from transformers import AutoTokenizer; AutoTokenizer.from_pretrained('HuggingFaceH4/zephyr-7b-beta').push_to_hub('aws-neuron/zephyr-7b-seqlen-2048-bs-4')"

If you are trying to compile an LLM with a configuration that is not yet cached, it can take up to 45 minutes.

Deploying TGI Neuronx Endpoint

Before deploying the model to Amazon SageMaker, we must define the TGI Neuronx endpoint configuration. We need to make sure to set the following parameters according to the fixed-shape compilation parameters we used:

  • MAX_CONCURRENT_REQUESTS: Equal to the batch size that was used to compile the model.
  • MAX_INPUT_LENGTH: Less than or equal to the sequence length that was used to compile the model.
  • MAX_TOTAL_TOKENS: Equal to the sequence length that was used to compile the model.
  • MAX_BATCH_PREFILL_TOKENS: half of the maximum number of tokens [batch_size * sequence_length] / 2
  • MAX_BATCH_TOTAL_TOKENS: Equal to the maximum number of tokens [batch_size * sequence_length]

In addition, we need to set the HF_MODEL_ID pointing to the Hugging Face model ID.

import json
from sagemaker.huggingface import HuggingFaceModel

# sagemaker config & model config
instance_type = "ml.inf2.8xlarge"
health_check_timeout = 900
batch_size = 4
sequence_length = 2048

# Define Model and Endpoint configuration parameter
config = {
  'HF_MODEL_ID': "aws-neuron/zephyr-7b-seqlen-2048-bs-4-cores-2",
  'MAX_CONCURRENT_REQUESTS': json.dumps(batch_size),
  'MAX_INPUT_LENGTH': json.dumps(1512),
  'MAX_TOTAL_TOKENS': json.dumps(sequence_length),
  'MAX_BATCH_PREFILL_TOKENS': json.dumps(int(sequence_length*batch_size / 2)),
  'MAX_BATCH_TOTAL_TOKENS': json.dumps(sequence_length*batch_size),

# create HuggingFaceModel with the image uri
llm_model = HuggingFaceModel(

After we have created the HuggingFaceModel we can deploy it to Amazon SageMaker using the deploy method. We will deploy the model with the ml.inf2.8xlarge instance type.

# Deploy model to an endpoint
llm = llm_model.deploy(

SageMaker will create our endpoint and deploy the model to it. This can take 10-15 minutes.

5. Run inference and chat with the model

After our endpoint is deployed, we can run inference on it, using the predict method from predictor. We can provide different parameters to impact the generation, adding them to the parameters attribute of the payload. You can find the supported parameters here, or in the open API specification of TGI in the swagger documentation

The HuggingFaceH4/zephyr-7b-beta is a conversational chat model, meaning we can chat with it using a prompt structure like the following:

<|system|>\nYou are a friendly.</s>\n<|user|>\nInstruction</s>\n<|assistant|>\n

Manually preparing the prompt is error prone, so we can use the apply_chat_template method from the tokenizer to help with it. It expects a messages dictionary in the well-known OpenAI format, and converts it into the correct format for the model. Let's see if Zephyr knows some facts about AWS.

from transformers import AutoTokenizer

# load the tokenizer
tokenizer = AutoTokenizer.from_pretrained("aws-neuron/zephyr-7b-seqlen-2048-bs-4-cores-2")

# Prompt to generate
messages = [
    {"role": "system", "content": "You are the AWS expert"},
    {"role": "user", "content": "Can you tell me an interesting fact about AWS?"},
prompt = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)

# Generation arguments
payload = {
    "do_sample": True,
    "top_p": 0.6,
    "temperature": 0.9,
    "top_k": 50,
    "max_new_tokens": 256,
    "repetition_penalty": 1.03,
    "return_full_text": False,
    "stop": ["</s>"]
chat = llm.predict({"inputs":prompt, "parameters":payload})

# Sure, here's an interesting fact about AWS: As of 2021, AWS has more than 200 services in its portfolio, ranging from compute power and storage to databases,

Awesome, we have successfully deployed Zephyr to Amazon SageMaker on Inferentia2 and chatted with it.

6. Clean up

To clean up, we can delete the model and endpoint.



The integration of Hugging Face Text Generation Inference (TGI) with AWS Inferentia2 and Amazon SageMaker provides a cost-effective alternative solution for deploying Large Language Models (LLMs).

We're actively working on supporting more models, streamlining the compilation process, and refining the caching system.

Thanks for reading! If you have any questions, feel free to contact me on Twitter or LinkedIn.