philschmid's picture
philschmid HF staff
Update README.md
4da517f
---
license: mit
tags:
- endpoints-template
- optimum
library_name: generic
---
# Optimized and Quantized [deepset/roberta-base-squad2](https://huggingface.co/deepset/roberta-base-squad2) with a custom handler.py
This repository implements a `custom` handler for `question-answering` for 🤗 Inference Endpoints for accelerated inference using [🤗 Optiumum](https://huggingface.co/docs/optimum/index). The code for the customized handler is in the [handler.py](https://huggingface.co/philschmid/roberta-base-squad2-optimized/blob/main/handler.py).
Below is also describe how we converted & optimized the model, based on the [Accelerate Transformers with Hugging Face Optimum](https://huggingface.co/blog/optimum-inference) blog post. You can also check out the [notebook](https://huggingface.co/philschmid/roberta-base-squad2-optimized/blob/main/optimize_model.ipynb).
### expected Request payload
```json
{
"inputs": {
"question": "As what is Philipp working?",
"context": "Hello, my name is Philipp and I live in Nuremberg, Germany. Currently I am working as a Technical Lead at Hugging Face to democratize artificial intelligence through open source and open science. In the past I designed and implemented cloud-native machine learning architectures for fin-tech and insurance companies. I found my passion for cloud concepts and machine learning 5 years ago. Since then I never stopped learning. Currently, I am focusing myself in the area NLP and how to leverage models like BERT, Roberta, T5, ViT, and GPT2 to generate business value."
}
}
```
below is an example on how to run a request using Python and `requests`.
## Run Request
```python
import json
from typing import List
import requests as r
import base64
ENDPOINT_URL = ""
HF_TOKEN = ""
def predict(question:str=None,context:str=None):
payload = {"inputs": {"question": question, "context": context}}
response = r.post(
ENDPOINT_URL, headers={"Authorization": f"Bearer {HF_TOKEN}"}, json=payload
)
return response.json()
prediction = predict(
question="As what is Philipp working?",
context="Hello, my name is Philipp and I live in Nuremberg, Germany. Currently I am working as a Technical Lead at Hugging Face to democratize artificial intelligence through open source and open science."
)
```
expected output
```python
{
'score': 0.4749588668346405,
'start': 88,
'end': 102,
'answer': 'Technical Lead'
}
```
# Convert & Optimize model with Optimum
Steps:
1. [Convert model to ONNX](#1-convert-model-to-onnx)
2. [Optimize & quantize model with Optimum](#2-optimize--quantize-model-with-optimum)
3. [Create Custom Handler for Inference Endpoints](#3-create-custom-handler-for-inference-endpoints)
4. [Test Custom Handler Locally](#4-test-custom-handler-locally)
5. [Push to repository and create Inference Endpoint](#5-push-to-repository-and-create-inference-endpoint)
Helpful links:
* [Accelerate Transformers with Hugging Face Optimum](https://huggingface.co/blog/optimum-inference)
* [Optimizing Transformers for GPUs with Optimum](https://www.philschmid.de/optimizing-transformers-with-optimum-gpu)
* [Optimum Documentation](https://huggingface.co/docs/optimum/onnxruntime/modeling_ort)
* [Create Custom Handler Endpoints](https://link-to-docs)
## Setup & Installation
```python
%%writefile requirements.txt
optimum[onnxruntime]==1.4.0
mkl-include
mkl
```
```python
!pip install -r requirements.txt
```
## 0. Base line Performance
```python
from transformers import pipeline
qa = pipeline("question-answering",model="deepset/roberta-base-squad2")
```
Okay, let's test the performance (latency) with sequence length of 128.
```python
context="Hello, my name is Philipp and I live in Nuremberg, Germany. Currently I am working as a Technical Lead at Hugging Face to democratize artificial intelligence through open source and open science. In the past I designed and implemented cloud-native machine learning architectures for fin-tech and insurance companies. I found my passion for cloud concepts and machine learning 5 years ago. Since then I never stopped learning. Currently, I am focusing myself in the area NLP and how to leverage models like BERT, Roberta, T5, ViT, and GPT2 to generate business value."
question="As what is Philipp working?"
payload = {"inputs": {"question": question, "context": context}}
```
```python
from time import perf_counter
import numpy as np
def measure_latency(pipe,payload):
latencies = []
# warm up
for _ in range(10):
_ = pipe(question=payload["inputs"]["question"], context=payload["inputs"]["context"])
# Timed run
for _ in range(50):
start_time = perf_counter()
_ = pipe(question=payload["inputs"]["question"], context=payload["inputs"]["context"])
latency = perf_counter() - start_time
latencies.append(latency)
# Compute run statistics
time_avg_ms = 1000 * np.mean(latencies)
time_std_ms = 1000 * np.std(latencies)
return f"Average latency (ms) - {time_avg_ms:.2f} +\- {time_std_ms:.2f}"
print(f"Vanilla model {measure_latency(qa,payload)}")
# Vanilla model Average latency (ms) - 64.15 +\- 2.44
```
## 1. Convert model to ONNX
```python
from optimum.onnxruntime import ORTModelForQuestionAnswering
from transformers import AutoTokenizer
from pathlib import Path
model_id="deepset/roberta-base-squad2"
onnx_path = Path(".")
# load vanilla transformers and convert to onnx
model = ORTModelForQuestionAnswering.from_pretrained(model_id, from_transformers=True)
tokenizer = AutoTokenizer.from_pretrained(model_id)
# save onnx checkpoint and tokenizer
model.save_pretrained(onnx_path)
tokenizer.save_pretrained(onnx_path)
```
## 2. Optimize & quantize model with Optimum
```python
from optimum.onnxruntime import ORTOptimizer, ORTQuantizer
from optimum.onnxruntime.configuration import OptimizationConfig, AutoQuantizationConfig
# Create the optimizer
optimizer = ORTOptimizer.from_pretrained(model)
# Define the optimization strategy by creating the appropriate configuration
optimization_config = OptimizationConfig(optimization_level=99) # enable all optimizations
# Optimize the model
optimizer.optimize(save_dir=onnx_path, optimization_config=optimization_config)
```
```python
# create ORTQuantizer and define quantization configuration
dynamic_quantizer = ORTQuantizer.from_pretrained(onnx_path, file_name="model_optimized.onnx")
dqconfig = AutoQuantizationConfig.avx512_vnni(is_static=False, per_channel=False)
# apply the quantization configuration to the model
model_quantized_path = dynamic_quantizer.quantize(
save_dir=onnx_path,
quantization_config=dqconfig,
)
```
## 3. Create Custom Handler for Inference Endpoints
```python
%%writefile handler.py
from typing import Dict, List, Any
from optimum.onnxruntime import ORTModelForQuestionAnswering
from transformers import AutoTokenizer, pipeline
class EndpointHandler():
def __init__(self, path=""):
# load the optimized model
self.model = ORTModelForQuestionAnswering.from_pretrained(path, file_name="model_optimized_quantized.onnx")
self.tokenizer = AutoTokenizer.from_pretrained(path)
# create pipeline
self.pipeline = pipeline("question-answering", model=self.model, tokenizer=self.tokenizer)
def __call__(self, data: Any) -> List[List[Dict[str, float]]]:
"""
Args:
data (:obj:):
includes the input data and the parameters for the inference.
Return:
A :obj:`list`:. The list contains the answer and scores of the inference inputs
"""
inputs = data.get("inputs", data)
# run the model
prediction = self.pipeline(**inputs)
# return prediction
return prediction
```
## 4. Test Custom Handler Locally
```python
from handler import EndpointHandler
# init handler
my_handler = EndpointHandler(path=".")
# prepare sample payload
context="Hello, my name is Philipp and I live in Nuremberg, Germany. Currently I am working as a Technical Lead at Hugging Face to democratize artificial intelligence through open source and open science. In the past I designed and implemented cloud-native machine learning architectures for fin-tech and insurance companies. I found my passion for cloud concepts and machine learning 5 years ago. Since then I never stopped learning. Currently, I am focusing myself in the area NLP and how to leverage models like BERT, Roberta, T5, ViT, and GPT2 to generate business value."
question="As what is Philipp working?"
payload = {"inputs": {"question": question, "context": context}}
# test the handler
my_handler(payload)
```
```python
from time import perf_counter
import numpy as np
def measure_latency(handler,payload):
latencies = []
# warm up
for _ in range(10):
_ = handler(payload)
# Timed run
for _ in range(50):
start_time = perf_counter()
_ = handler(payload)
latency = perf_counter() - start_time
latencies.append(latency)
# Compute run statistics
time_avg_ms = 1000 * np.mean(latencies)
time_std_ms = 1000 * np.std(latencies)
return f"Average latency (ms) - {time_avg_ms:.2f} +\- {time_std_ms:.2f}"
print(f"Optimized & Quantized model {measure_latency(my_handler,payload)}")
#
```
`Optimized & Quantized model Average latency (ms) - 29.90 +\- 0.53`
`Vanilla model Average latency (ms) - 64.15 +\- 2.44`
## 5. Push to repository and create Inference Endpoint
```python
# add all our new files
!git add *
# commit our files
!git commit -m "add custom handler"
# push the files to the hub
!git push
```