Deploying TensorFlow Vision Models in Hugging Face with TF Serving

Published July 25, 2022
Update on GitHub
Open In Colab

In the past few months, the Hugging Face team and external contributors added a variety of vision models in TensorFlow to Transformers. This list is growing comprehensively and already includes state-of-the-art pre-trained models like Vision Transformer, Masked Autoencoders, RegNet, ConvNeXt, and many others!

When it comes to deploying TensorFlow models, you have got a variety of options. Depending on your use case, you may want to expose your model as an endpoint or package it in an application itself. TensorFlow provides tools that cater to each of these different scenarios.

In this post, you'll see how to deploy a Vision Transformer (ViT) model (for image classification) locally using TensorFlow Serving (TF Serving). This will allow developers to expose the model either as a REST or gRPC endpoint. Moreover, TF Serving supports many deployment-specific features off-the-shelf such as model warmup, server-side batching, etc.

To get the complete working code shown throughout this post, refer to the Colab Notebook shown at the beginning.

Saving the Model

All TensorFlow models in 🤗 Transformers have a method named save_pretrained(). With it, you can serialize the model weights in the h5 format as well as in the standalone SavedModel format. TF Serving needs a model to be present in the SavedModel format. So, let's first load a Vision Transformer model and save it:

from transformers import TFViTForImageClassification

temp_model_dir = "vit"
ckpt = "google/vit-base-patch16-224"

model = TFViTForImageClassification.from_pretrained(ckpt)
model.save_pretrained(temp_model_dir, saved_model=True)

By default, save_pretrained() will first create a version directory inside the path we provide to it. So, the path ultimately becomes: {temp_model_dir}/saved_model/{version}.

We can inspect the serving signature of the SavedModel like so:

saved_model_cli show --dir {temp_model_dir}/saved_model/1 --tag_set serve --signature_def serving_default

This should output:

The given SavedModel SignatureDef contains the following input(s):
  inputs['pixel_values'] tensor_info:
      dtype: DT_FLOAT
      shape: (-1, -1, -1, -1)
      name: serving_default_pixel_values:0
The given SavedModel SignatureDef contains the following output(s):
  outputs['logits'] tensor_info:
      dtype: DT_FLOAT
      shape: (-1, 1000)
      name: StatefulPartitionedCall:0
Method name is: tensorflow/serving/predict

As can be noticed the model accepts single 4-d inputs (namely pixel_values) which has the following axes: (batch_size, num_channels, height, width). For this model, the acceptable height and width are set to 224, and the number of channels is 3. You can verify this by inspecting the config argument of the model (model.config). The model yields a 1000-d vector of logits.

Model Surgery

Usually, every ML model has certain preprocessing and postprocessing steps. The ViT model is no exception to this. The major preprocessing steps include:

  • Scaling the image pixel values to [0, 1] range.

  • Normalizing the scaled pixel values to [-1, 1].

  • Resizing the image so that it has a spatial resolution of (224, 224).

You can confirm these by investigating the image processor associated with the model:

from transformers import AutoImageProcessor

processor = AutoImageProcessor.from_pretrained(ckpt)

This should print:

ViTImageProcessor {
  "do_normalize": true,
  "do_resize": true,
  "image_mean": [
  "image_std": [
  "resample": 2,
  "size": 224

Since this is an image classification model pre-trained on the ImageNet-1k dataset, the model outputs need to be mapped to the ImageNet-1k classes as the post-processing step.

To reduce the developers' cognitive load and training-serving skew, it's often a good idea to ship a model that has most of the preprocessing and postprocessing steps in built. Therefore, you should serialize the model as a SavedModel such that the above-mentioned processing ops get embedded into its computation graph.


For preprocessing, image normalization is one of the most essential components:

def normalize_img(
    img, mean=processor.image_mean, std=processor.image_std
    # Scale to the value range of [0, 1] first and then normalize.
    img = img / 255
    mean = tf.constant(mean)
    std = tf.constant(std)
    return (img - mean) / std

You also need to resize the image and transpose it so that it has leading channel dimensions since following the standard format of 🤗 Transformers. The below code snippet shows all the preprocessing steps:

CONCRETE_INPUT = "pixel_values" # Which is what we investigated via the SavedModel CLI.
SIZE = processor.size["height"]

def normalize_img(
    img, mean=processor.image_mean, std=processor.image_std
    # Scale to the value range of [0, 1] first and then normalize.
    img = img / 255
    mean = tf.constant(mean)
    std = tf.constant(std)
    return (img - mean) / std

def preprocess(string_input):
    decoded_input =
    decoded =, channels=3)
    resized = tf.image.resize(decoded, size=(SIZE, SIZE))
    normalized = normalize_img(resized)
    normalized = tf.transpose(
        normalized, (2, 0, 1)
    )  # Since HF models are channel-first.
    return normalized

@tf.function(input_signature=[tf.TensorSpec([None], tf.string)])
def preprocess_fn(string_input):
    decoded_images = tf.map_fn(
        preprocess, string_input, dtype=tf.float32, back_prop=False
    return {CONCRETE_INPUT: decoded_images}

Note on making the model accept string inputs:

When dealing with images via REST or gRPC requests the size of the request payload can easily spiral up depending on the resolution of the images being passed. This is why it is a good practice to compress them reliably and then prepare the request payload.

Postprocessing and Model Export

You're now equipped with the preprocessing operations that you can inject into the model's existing computation graph. In this section, you'll also inject the post-processing operations into the graph and export the model!

def model_exporter(model: tf.keras.Model):
    m_call = tf.function(
            shape=[None, 3, SIZE, SIZE], dtype=tf.float32, name=CONCRETE_INPUT

    @tf.function(input_signature=[tf.TensorSpec([None], tf.string)])
    def serving_fn(string_input):
        labels = tf.constant(list(model.config.id2label.values()), dtype=tf.string)
        images = preprocess_fn(string_input)
        predictions = m_call(**images)
        indices = tf.argmax(predictions.logits, axis=1)
        pred_source = tf.gather(params=labels, indices=indices)
        probs = tf.nn.softmax(predictions.logits, axis=1)
        pred_confidence = tf.reduce_max(probs, axis=1)
        return {"label": pred_source, "confidence": pred_confidence}

    return serving_fn

You can first derive the concrete function from the model's forward pass method (call()) so the model is nicely compiled into a graph. After that, you can apply the following steps in order:

  1. Pass the inputs through the preprocessing operations.

  2. Pass the preprocessing inputs through the derived concrete function.

  3. Post-process the outputs and return them in a nicely formatted dictionary.

Now it's time to export the model!

MODEL_DIR = tempfile.gettempdir()
    os.path.join(MODEL_DIR, str(VERSION)),
    signatures={"serving_default": model_exporter(model)},
os.environ["MODEL_DIR"] = MODEL_DIR

After exporting, let's inspect the model signatures again:

saved_model_cli show --dir {MODEL_DIR}/1 --tag_set serve --signature_def serving_default
The given SavedModel SignatureDef contains the following input(s):
  inputs['string_input'] tensor_info:
      dtype: DT_STRING
      shape: (-1)
      name: serving_default_string_input:0
The given SavedModel SignatureDef contains the following output(s):
  outputs['confidence'] tensor_info:
      dtype: DT_FLOAT
      shape: (-1)
      name: StatefulPartitionedCall:0
  outputs['label'] tensor_info:
      dtype: DT_STRING
      shape: (-1)
      name: StatefulPartitionedCall:1
Method name is: tensorflow/serving/predict

You can notice that the model's signature has now changed. Specifically, the input type is now a string and the model returns two things: a confidence score and the string label.

Provided you've already installed TF Serving (covered in the Colab Notebook), you're now ready to deploy this model!

Deployment with TensorFlow Serving

It just takes a single command to do this:

nohup tensorflow_model_server \
  --rest_api_port=8501 \
  --model_name=vit \
  --model_base_path=$MODEL_DIR >server.log 2>&1

From the above command, the important parameters are:

  • rest_api_port denotes the port number that TF Serving will use deploying the REST endpoint of your model. By default, TF Serving uses the 8500 port for the gRPC endpoint.

  • model_name specifies the model name (can be anything) that will used for calling the APIs.

  • model_base_path denotes the base model path that TF Serving will use to load the latest version of the model.

(The complete list of supported parameters is here.)

And voila! Within minutes, you should be up and running with a deployed model having two endpoints - REST and gRPC.

Querying the REST Endpoint

Recall that you exported the model such that it accepts string inputs encoded with the base64 format. So, to craft the request payload you can do something like this:

# Get image of a cute cat.
image_path = tf.keras.utils.get_file(
    "image.jpg", ""

# Read the image from disk as raw bytes and then encode it. 
bytes_inputs =
b64str = base64.urlsafe_b64encode(bytes_inputs.numpy()).decode("utf-8")

# Create the request payload.
data = json.dumps({"signature_name": "serving_default", "instances": [b64str]})

TF Serving's request payload format specification for the REST endpoint is available here. Within the instances you can pass multiple encoded images. This kind of endpoints are meant to be consumed for online prediction scenarios. For inputs having more than a single data point, you would to want to enable batching to get performance optimization benefits.

Now you can call the API:

headers = {"content-type": "application/json"}
json_response =
    "http://localhost:8501/v1/models/vit:predict", data=data, headers=headers
# {'predictions': [{'label': 'Egyptian cat', 'confidence': 0.896659195}]}

The REST API is - http://localhost:8501/v1/models/vit:predict following the specification from here. By default, this always picks up the latest version of the model. But if you wanted a specific version you can do: http://localhost:8501/v1/models/vit/versions/1:predict.

Querying the gRPC Endpoint

While REST is quite popular in the API world, many applications often benefit from gRPC. This post does a good job comparing the two ways of deployment. gRPC is usually preferred for low-latency, highly scalable, and distributed systems.

There are a couple of steps are. First, you need to open a communication channel:

import grpc
from tensorflow_serving.apis import predict_pb2
from tensorflow_serving.apis import prediction_service_pb2_grpc

channel = grpc.insecure_channel("localhost:8500")
stub = prediction_service_pb2_grpc.PredictionServiceStub(channel)

Then, create the request payload:

request = predict_pb2.PredictRequest() = "vit"
request.model_spec.signature_name = "serving_default"

You can determine the serving_input key programmatically like so:

loaded = tf.saved_model.load(f"{MODEL_DIR}/{VERSION}")
serving_input = list(
print("Serving function input:", serving_input)
# Serving function input: string_input

Now, you can get some predictions:

grpc_predictions = stub.Predict(request, 10.0)  # 10 secs timeout
outputs {
  key: "confidence"
  value {
    dtype: DT_FLOAT
    tensor_shape {
      dim {
        size: 1
    float_val: 0.8966591954231262
outputs {
  key: "label"
  value {
    dtype: DT_STRING
    tensor_shape {
      dim {
        size: 1
    string_val: "Egyptian cat"
model_spec {
  name: "resnet"
  version {
    value: 1
  signature_name: "serving_default"

You can also fetch the key-values of our interest from the above results like so:

grpc_predictions.outputs["label"].string_val, grpc_predictions.outputs[
# ([b'Egyptian cat'], [0.8966591954231262])

Wrapping Up

In this post, we learned how to deploy a TensorFlow vision model from Transformers with TF Serving. While local deployments are great for weekend projects, we would want to be able to scale these deployments to serve many users. In the next series of posts, you'll see how to scale up these deployments with Kubernetes and Vertex AI.

Additional References