File size: 9,708 Bytes

---
license: mit
tags:
- endpoints-template
- optimum
library_name: generic
---

# Optimized and Quantized [deepset/roberta-base-squad2](https://huggingface.co/deepset/roberta-base-squad2) with a custom handler.py


This repository implements a `custom` handler for `question-answering` for 🤗 Inference Endpoints for accelerated inference using [🤗 Optiumum](https://huggingface.co/docs/optimum/index). The code for the customized handler is in the [handler.py](https://huggingface.co/philschmid/roberta-base-squad2-optimized/blob/main/handler.py).

Below is also describe how we converted & optimized the model, based on the [Accelerate Transformers with Hugging Face Optimum](https://huggingface.co/blog/optimum-inference) blog post. You can also check out the [notebook](https://huggingface.co/philschmid/roberta-base-squad2-optimized/blob/main/optimize_model.ipynb).

### expected Request payload

```json
{
    "inputs": {
        "question": "As what is Philipp working?", 
        "context": "Hello, my name is Philipp and I live in Nuremberg, Germany. Currently I am working as a Technical Lead at Hugging Face to democratize artificial intelligence through open source and open science. In the past I designed and implemented cloud-native machine learning architectures for fin-tech and insurance companies. I found my passion for cloud concepts and machine learning 5 years ago. Since then I never stopped learning. Currently, I am focusing myself in the area NLP and how to leverage models like BERT, Roberta, T5, ViT, and GPT2 to generate business value."
    }
}
```

below is an example on how to run a request using Python and `requests`.

## Run Request 

```python
import json
from typing import List
import requests as r
import base64

ENDPOINT_URL = ""
HF_TOKEN = ""


def predict(question:str=None,context:str=None):
    payload = {"inputs": {"question": question, "context": context}}
    response = r.post(
        ENDPOINT_URL, headers={"Authorization": f"Bearer {HF_TOKEN}"}, json=payload
    )
    return response.json()


prediction = predict(
    question="As what is Philipp working?",
    context="Hello, my name is Philipp and I live in Nuremberg, Germany. Currently I am working as a Technical Lead at Hugging Face to democratize artificial intelligence through open source and open science."
)
```

expected output

```python
{
    'score': 0.4749588668346405,
    'start': 88,
    'end': 102,
    'answer': 'Technical Lead'
}
```



# Convert & Optimize model with Optimum 

Steps:
1. [Convert model to ONNX](#1-convert-model-to-onnx)
2. [Optimize & quantize model with Optimum](#2-optimize--quantize-model-with-optimum)
3. [Create Custom Handler for Inference Endpoints](#3-create-custom-handler-for-inference-endpoints)
4. [Test Custom Handler Locally](#4-test-custom-handler-locally)
5. [Push to repository and create Inference Endpoint](#5-push-to-repository-and-create-inference-endpoint)

Helpful links:
* [Accelerate Transformers with Hugging Face Optimum](https://huggingface.co/blog/optimum-inference)
* [Optimizing Transformers for GPUs with Optimum](https://www.philschmid.de/optimizing-transformers-with-optimum-gpu)
* [Optimum Documentation](https://huggingface.co/docs/optimum/onnxruntime/modeling_ort)
* [Create Custom Handler Endpoints](https://link-to-docs)

## Setup & Installation


```python
%%writefile requirements.txt
optimum[onnxruntime]==1.4.0
mkl-include
mkl
```


```python
!pip install -r requirements.txt
```

## 0. Base line Performance


```python
from transformers import pipeline

qa = pipeline("question-answering",model="deepset/roberta-base-squad2")
```

Okay, let's test the performance (latency) with sequence length of 128.


```python
context="Hello, my name is Philipp and I live in Nuremberg, Germany. Currently I am working as a Technical Lead at Hugging Face to democratize artificial intelligence through open source and open science. In the past I designed and implemented cloud-native machine learning architectures for fin-tech and insurance companies. I found my passion for cloud concepts and machine learning 5 years ago. Since then I never stopped learning. Currently, I am focusing myself in the area NLP and how to leverage models like BERT, Roberta, T5, ViT, and GPT2 to generate business value." 
question="As what is Philipp working?" 

payload = {"inputs": {"question": question, "context": context}}
```


```python
from time import perf_counter
import numpy as np 

def measure_latency(pipe,payload):
    latencies = []
    # warm up
    for _ in range(10):
        _ = pipe(question=payload["inputs"]["question"], context=payload["inputs"]["context"])
    # Timed run
    for _ in range(50):
        start_time = perf_counter()
        _ =  pipe(question=payload["inputs"]["question"], context=payload["inputs"]["context"])
        latency = perf_counter() - start_time
        latencies.append(latency)
    # Compute run statistics
    time_avg_ms = 1000 * np.mean(latencies)
    time_std_ms = 1000 * np.std(latencies)
    return f"Average latency (ms) - {time_avg_ms:.2f} +\- {time_std_ms:.2f}"

print(f"Vanilla model {measure_latency(qa,payload)}")
#     Vanilla model Average latency (ms) - 64.15 +\- 2.44
```



## 1. Convert model to ONNX


```python
from optimum.onnxruntime import ORTModelForQuestionAnswering
from transformers import AutoTokenizer
from pathlib import Path


model_id="deepset/roberta-base-squad2"
onnx_path = Path(".")

# load vanilla transformers and convert to onnx
model = ORTModelForQuestionAnswering.from_pretrained(model_id, from_transformers=True)
tokenizer = AutoTokenizer.from_pretrained(model_id)

# save onnx checkpoint and tokenizer
model.save_pretrained(onnx_path)
tokenizer.save_pretrained(onnx_path)
```


## 2. Optimize & quantize model with Optimum


```python
from optimum.onnxruntime import ORTOptimizer, ORTQuantizer
from optimum.onnxruntime.configuration import OptimizationConfig, AutoQuantizationConfig

# Create the optimizer
optimizer = ORTOptimizer.from_pretrained(model)

# Define the optimization strategy by creating the appropriate configuration
optimization_config = OptimizationConfig(optimization_level=99) # enable all optimizations

# Optimize the model
optimizer.optimize(save_dir=onnx_path, optimization_config=optimization_config)
```


```python
# create ORTQuantizer and define quantization configuration
dynamic_quantizer = ORTQuantizer.from_pretrained(onnx_path, file_name="model_optimized.onnx")
dqconfig = AutoQuantizationConfig.avx512_vnni(is_static=False, per_channel=False)

# apply the quantization configuration to the model
model_quantized_path = dynamic_quantizer.quantize(
    save_dir=onnx_path,
    quantization_config=dqconfig,
)

```

## 3. Create Custom Handler for Inference Endpoints



```python
%%writefile handler.py
from typing import  Dict, List, Any
from optimum.onnxruntime import ORTModelForQuestionAnswering
from transformers import AutoTokenizer, pipeline


class EndpointHandler():
    def __init__(self, path=""):
        # load the optimized model
        self.model = ORTModelForQuestionAnswering.from_pretrained(path, file_name="model_optimized_quantized.onnx")
        self.tokenizer = AutoTokenizer.from_pretrained(path)
        # create pipeline
        self.pipeline = pipeline("question-answering", model=self.model, tokenizer=self.tokenizer)

    def __call__(self, data: Any) -> List[List[Dict[str, float]]]:
        """
        Args:
            data (:obj:):
                includes the input data and the parameters for the inference.
        Return:
            A :obj:`list`:. The list contains the answer and scores of the inference inputs
        """
        inputs = data.get("inputs", data)
        # run the model
        prediction = self.pipeline(**inputs)
        # return prediction
        return prediction
```

## 4. Test Custom Handler Locally



```python
from handler import EndpointHandler

# init handler
my_handler = EndpointHandler(path=".")

# prepare sample payload
context="Hello, my name is Philipp and I live in Nuremberg, Germany. Currently I am working as a Technical Lead at Hugging Face to democratize artificial intelligence through open source and open science. In the past I designed and implemented cloud-native machine learning architectures for fin-tech and insurance companies. I found my passion for cloud concepts and machine learning 5 years ago. Since then I never stopped learning. Currently, I am focusing myself in the area NLP and how to leverage models like BERT, Roberta, T5, ViT, and GPT2 to generate business value." 
question="As what is Philipp working?" 

payload = {"inputs": {"question": question, "context": context}}

# test the handler
my_handler(payload)
```


```python
from time import perf_counter
import numpy as np 

def measure_latency(handler,payload):
    latencies = []
    # warm up
    for _ in range(10):
        _ = handler(payload)
    # Timed run
    for _ in range(50):
        start_time = perf_counter()
        _ =  handler(payload)
        latency = perf_counter() - start_time
        latencies.append(latency)
    # Compute run statistics
    time_avg_ms = 1000 * np.mean(latencies)
    time_std_ms = 1000 * np.std(latencies)
    return f"Average latency (ms) - {time_avg_ms:.2f} +\- {time_std_ms:.2f}"

print(f"Optimized & Quantized model {measure_latency(my_handler,payload)}")
#     

```

`Optimized & Quantized model Average latency (ms) - 29.90 +\- 0.53`
`Vanilla model Average latency (ms) - 64.15 +\- 2.44`

## 5. Push to repository and create Inference Endpoint



```python
# add all our new files
!git add * 
# commit our files
!git commit -m "add custom handler"
# push the files to the hub
!git push
```