philschmid's picture
philschmid HF staff
Update README.md
4da517f
metadata
license: mit
tags:
  - endpoints-template
  - optimum
library_name: generic

Optimized and Quantized deepset/roberta-base-squad2 with a custom handler.py

This repository implements a custom handler for question-answering for 🤗 Inference Endpoints for accelerated inference using 🤗 Optiumum. The code for the customized handler is in the handler.py.

Below is also describe how we converted & optimized the model, based on the Accelerate Transformers with Hugging Face Optimum blog post. You can also check out the notebook.

expected Request payload

{
    "inputs": {
        "question": "As what is Philipp working?", 
        "context": "Hello, my name is Philipp and I live in Nuremberg, Germany. Currently I am working as a Technical Lead at Hugging Face to democratize artificial intelligence through open source and open science. In the past I designed and implemented cloud-native machine learning architectures for fin-tech and insurance companies. I found my passion for cloud concepts and machine learning 5 years ago. Since then I never stopped learning. Currently, I am focusing myself in the area NLP and how to leverage models like BERT, Roberta, T5, ViT, and GPT2 to generate business value."
    }
}

below is an example on how to run a request using Python and requests.

Run Request

import json
from typing import List
import requests as r
import base64

ENDPOINT_URL = ""
HF_TOKEN = ""


def predict(question:str=None,context:str=None):
    payload = {"inputs": {"question": question, "context": context}}
    response = r.post(
        ENDPOINT_URL, headers={"Authorization": f"Bearer {HF_TOKEN}"}, json=payload
    )
    return response.json()


prediction = predict(
    question="As what is Philipp working?",
    context="Hello, my name is Philipp and I live in Nuremberg, Germany. Currently I am working as a Technical Lead at Hugging Face to democratize artificial intelligence through open source and open science."
)

expected output

{
    'score': 0.4749588668346405,
    'start': 88,
    'end': 102,
    'answer': 'Technical Lead'
}

Convert & Optimize model with Optimum

Steps:

  1. Convert model to ONNX
  2. Optimize & quantize model with Optimum
  3. Create Custom Handler for Inference Endpoints
  4. Test Custom Handler Locally
  5. Push to repository and create Inference Endpoint

Helpful links:

Setup & Installation

%%writefile requirements.txt
optimum[onnxruntime]==1.4.0
mkl-include
mkl
!pip install -r requirements.txt

0. Base line Performance

from transformers import pipeline

qa = pipeline("question-answering",model="deepset/roberta-base-squad2")

Okay, let's test the performance (latency) with sequence length of 128.

context="Hello, my name is Philipp and I live in Nuremberg, Germany. Currently I am working as a Technical Lead at Hugging Face to democratize artificial intelligence through open source and open science. In the past I designed and implemented cloud-native machine learning architectures for fin-tech and insurance companies. I found my passion for cloud concepts and machine learning 5 years ago. Since then I never stopped learning. Currently, I am focusing myself in the area NLP and how to leverage models like BERT, Roberta, T5, ViT, and GPT2 to generate business value." 
question="As what is Philipp working?" 

payload = {"inputs": {"question": question, "context": context}}
from time import perf_counter
import numpy as np 

def measure_latency(pipe,payload):
    latencies = []
    # warm up
    for _ in range(10):
        _ = pipe(question=payload["inputs"]["question"], context=payload["inputs"]["context"])
    # Timed run
    for _ in range(50):
        start_time = perf_counter()
        _ =  pipe(question=payload["inputs"]["question"], context=payload["inputs"]["context"])
        latency = perf_counter() - start_time
        latencies.append(latency)
    # Compute run statistics
    time_avg_ms = 1000 * np.mean(latencies)
    time_std_ms = 1000 * np.std(latencies)
    return f"Average latency (ms) - {time_avg_ms:.2f} +\- {time_std_ms:.2f}"

print(f"Vanilla model {measure_latency(qa,payload)}")
#     Vanilla model Average latency (ms) - 64.15 +\- 2.44

1. Convert model to ONNX

from optimum.onnxruntime import ORTModelForQuestionAnswering
from transformers import AutoTokenizer
from pathlib import Path


model_id="deepset/roberta-base-squad2"
onnx_path = Path(".")

# load vanilla transformers and convert to onnx
model = ORTModelForQuestionAnswering.from_pretrained(model_id, from_transformers=True)
tokenizer = AutoTokenizer.from_pretrained(model_id)

# save onnx checkpoint and tokenizer
model.save_pretrained(onnx_path)
tokenizer.save_pretrained(onnx_path)

2. Optimize & quantize model with Optimum

from optimum.onnxruntime import ORTOptimizer, ORTQuantizer
from optimum.onnxruntime.configuration import OptimizationConfig, AutoQuantizationConfig

# Create the optimizer
optimizer = ORTOptimizer.from_pretrained(model)

# Define the optimization strategy by creating the appropriate configuration
optimization_config = OptimizationConfig(optimization_level=99) # enable all optimizations

# Optimize the model
optimizer.optimize(save_dir=onnx_path, optimization_config=optimization_config)
# create ORTQuantizer and define quantization configuration
dynamic_quantizer = ORTQuantizer.from_pretrained(onnx_path, file_name="model_optimized.onnx")
dqconfig = AutoQuantizationConfig.avx512_vnni(is_static=False, per_channel=False)

# apply the quantization configuration to the model
model_quantized_path = dynamic_quantizer.quantize(
    save_dir=onnx_path,
    quantization_config=dqconfig,
)

3. Create Custom Handler for Inference Endpoints

%%writefile handler.py
from typing import  Dict, List, Any
from optimum.onnxruntime import ORTModelForQuestionAnswering
from transformers import AutoTokenizer, pipeline


class EndpointHandler():
    def __init__(self, path=""):
        # load the optimized model
        self.model = ORTModelForQuestionAnswering.from_pretrained(path, file_name="model_optimized_quantized.onnx")
        self.tokenizer = AutoTokenizer.from_pretrained(path)
        # create pipeline
        self.pipeline = pipeline("question-answering", model=self.model, tokenizer=self.tokenizer)

    def __call__(self, data: Any) -> List[List[Dict[str, float]]]:
        """
        Args:
            data (:obj:):
                includes the input data and the parameters for the inference.
        Return:
            A :obj:`list`:. The list contains the answer and scores of the inference inputs
        """
        inputs = data.get("inputs", data)
        # run the model
        prediction = self.pipeline(**inputs)
        # return prediction
        return prediction

4. Test Custom Handler Locally

from handler import EndpointHandler

# init handler
my_handler = EndpointHandler(path=".")

# prepare sample payload
context="Hello, my name is Philipp and I live in Nuremberg, Germany. Currently I am working as a Technical Lead at Hugging Face to democratize artificial intelligence through open source and open science. In the past I designed and implemented cloud-native machine learning architectures for fin-tech and insurance companies. I found my passion for cloud concepts and machine learning 5 years ago. Since then I never stopped learning. Currently, I am focusing myself in the area NLP and how to leverage models like BERT, Roberta, T5, ViT, and GPT2 to generate business value." 
question="As what is Philipp working?" 

payload = {"inputs": {"question": question, "context": context}}

# test the handler
my_handler(payload)
from time import perf_counter
import numpy as np 

def measure_latency(handler,payload):
    latencies = []
    # warm up
    for _ in range(10):
        _ = handler(payload)
    # Timed run
    for _ in range(50):
        start_time = perf_counter()
        _ =  handler(payload)
        latency = perf_counter() - start_time
        latencies.append(latency)
    # Compute run statistics
    time_avg_ms = 1000 * np.mean(latencies)
    time_std_ms = 1000 * np.std(latencies)
    return f"Average latency (ms) - {time_avg_ms:.2f} +\- {time_std_ms:.2f}"

print(f"Optimized & Quantized model {measure_latency(my_handler,payload)}")
#     

Optimized & Quantized model Average latency (ms) - 29.90 +\- 0.53 Vanilla model Average latency (ms) - 64.15 +\- 2.44

5. Push to repository and create Inference Endpoint

# add all our new files
!git add * 
# commit our files
!git commit -m "add custom handler"
# push the files to the hub
!git push