File size: 9,708 Bytes
a854397
 
 
 
 
 
 
 
141e879
a854397
 
141e879
a854397
141e879
a854397
 
 
 
 
141e879
 
 
 
a854397
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
141e879
 
a854397
 
 
 
 
 
 
141e879
 
a854397
 
 
 
 
 
141e879
 
 
 
 
a854397
 
 
 
 
141e879
a854397
 
141e879
 
 
 
 
a854397
 
141e879
 
 
a854397
 
 
 
141e879
a854397
 
141e879
a854397
 
 
 
 
 
 
 
 
141e879
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
a854397
 
 
 
141e879
a854397
 
 
 
141e879
a854397
 
 
141e879
a854397
 
 
 
 
 
 
 
 
 
 
 
 
 
 
141e879
 
 
 
a854397
 
141e879
 
 
a854397
 
141e879
a854397
141e879
a854397
 
 
141e879
 
a854397
 
 
 
 
 
 
 
141e879
a854397
141e879
a854397
141e879
 
a854397
 
141e879
a854397
 
141e879
a854397
141e879
 
a854397
 
 
 
 
 
 
141e879
a854397
 
 
141e879
 
 
a854397
 
141e879
 
a854397
 
 
141e879
a854397
 
141e879
a854397
 
141e879
 
a854397
141e879
a854397
141e879
 
a854397
 
141e879
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
4da517f
a854397
 
141e879
4da517f
141e879
 
 
 
 
 
 
 
 
 
 
 
 
a854397
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
---
license: mit
tags:
- endpoints-template
- optimum
library_name: generic
---

# Optimized and Quantized [deepset/roberta-base-squad2](https://huggingface.co/deepset/roberta-base-squad2) with a custom handler.py


This repository implements a `custom` handler for `question-answering` for 🤗 Inference Endpoints for accelerated inference using [🤗 Optiumum](https://huggingface.co/docs/optimum/index). The code for the customized handler is in the [handler.py](https://huggingface.co/philschmid/roberta-base-squad2-optimized/blob/main/handler.py).

Below is also describe how we converted & optimized the model, based on the [Accelerate Transformers with Hugging Face Optimum](https://huggingface.co/blog/optimum-inference) blog post. You can also check out the [notebook](https://huggingface.co/philschmid/roberta-base-squad2-optimized/blob/main/optimize_model.ipynb).

### expected Request payload

```json
{
    "inputs": {
        "question": "As what is Philipp working?", 
        "context": "Hello, my name is Philipp and I live in Nuremberg, Germany. Currently I am working as a Technical Lead at Hugging Face to democratize artificial intelligence through open source and open science. In the past I designed and implemented cloud-native machine learning architectures for fin-tech and insurance companies. I found my passion for cloud concepts and machine learning 5 years ago. Since then I never stopped learning. Currently, I am focusing myself in the area NLP and how to leverage models like BERT, Roberta, T5, ViT, and GPT2 to generate business value."
    }
}
```

below is an example on how to run a request using Python and `requests`.

## Run Request 

```python
import json
from typing import List
import requests as r
import base64

ENDPOINT_URL = ""
HF_TOKEN = ""


def predict(question:str=None,context:str=None):
    payload = {"inputs": {"question": question, "context": context}}
    response = r.post(
        ENDPOINT_URL, headers={"Authorization": f"Bearer {HF_TOKEN}"}, json=payload
    )
    return response.json()


prediction = predict(
    question="As what is Philipp working?",
    context="Hello, my name is Philipp and I live in Nuremberg, Germany. Currently I am working as a Technical Lead at Hugging Face to democratize artificial intelligence through open source and open science."
)
```

expected output

```python
{
    'score': 0.4749588668346405,
    'start': 88,
    'end': 102,
    'answer': 'Technical Lead'
}
```



# Convert & Optimize model with Optimum 

Steps:
1. [Convert model to ONNX](#1-convert-model-to-onnx)
2. [Optimize & quantize model with Optimum](#2-optimize--quantize-model-with-optimum)
3. [Create Custom Handler for Inference Endpoints](#3-create-custom-handler-for-inference-endpoints)
4. [Test Custom Handler Locally](#4-test-custom-handler-locally)
5. [Push to repository and create Inference Endpoint](#5-push-to-repository-and-create-inference-endpoint)

Helpful links:
* [Accelerate Transformers with Hugging Face Optimum](https://huggingface.co/blog/optimum-inference)
* [Optimizing Transformers for GPUs with Optimum](https://www.philschmid.de/optimizing-transformers-with-optimum-gpu)
* [Optimum Documentation](https://huggingface.co/docs/optimum/onnxruntime/modeling_ort)
* [Create Custom Handler Endpoints](https://link-to-docs)

## Setup & Installation


```python
%%writefile requirements.txt
optimum[onnxruntime]==1.4.0
mkl-include
mkl
```


```python
!pip install -r requirements.txt
```

## 0. Base line Performance


```python
from transformers import pipeline

qa = pipeline("question-answering",model="deepset/roberta-base-squad2")
```

Okay, let's test the performance (latency) with sequence length of 128.


```python
context="Hello, my name is Philipp and I live in Nuremberg, Germany. Currently I am working as a Technical Lead at Hugging Face to democratize artificial intelligence through open source and open science. In the past I designed and implemented cloud-native machine learning architectures for fin-tech and insurance companies. I found my passion for cloud concepts and machine learning 5 years ago. Since then I never stopped learning. Currently, I am focusing myself in the area NLP and how to leverage models like BERT, Roberta, T5, ViT, and GPT2 to generate business value." 
question="As what is Philipp working?" 

payload = {"inputs": {"question": question, "context": context}}
```


```python
from time import perf_counter
import numpy as np 

def measure_latency(pipe,payload):
    latencies = []
    # warm up
    for _ in range(10):
        _ = pipe(question=payload["inputs"]["question"], context=payload["inputs"]["context"])
    # Timed run
    for _ in range(50):
        start_time = perf_counter()
        _ =  pipe(question=payload["inputs"]["question"], context=payload["inputs"]["context"])
        latency = perf_counter() - start_time
        latencies.append(latency)
    # Compute run statistics
    time_avg_ms = 1000 * np.mean(latencies)
    time_std_ms = 1000 * np.std(latencies)
    return f"Average latency (ms) - {time_avg_ms:.2f} +\- {time_std_ms:.2f}"

print(f"Vanilla model {measure_latency(qa,payload)}")
#     Vanilla model Average latency (ms) - 64.15 +\- 2.44
```



## 1. Convert model to ONNX


```python
from optimum.onnxruntime import ORTModelForQuestionAnswering
from transformers import AutoTokenizer
from pathlib import Path


model_id="deepset/roberta-base-squad2"
onnx_path = Path(".")

# load vanilla transformers and convert to onnx
model = ORTModelForQuestionAnswering.from_pretrained(model_id, from_transformers=True)
tokenizer = AutoTokenizer.from_pretrained(model_id)

# save onnx checkpoint and tokenizer
model.save_pretrained(onnx_path)
tokenizer.save_pretrained(onnx_path)
```


## 2. Optimize & quantize model with Optimum


```python
from optimum.onnxruntime import ORTOptimizer, ORTQuantizer
from optimum.onnxruntime.configuration import OptimizationConfig, AutoQuantizationConfig

# Create the optimizer
optimizer = ORTOptimizer.from_pretrained(model)

# Define the optimization strategy by creating the appropriate configuration
optimization_config = OptimizationConfig(optimization_level=99) # enable all optimizations

# Optimize the model
optimizer.optimize(save_dir=onnx_path, optimization_config=optimization_config)
```


```python
# create ORTQuantizer and define quantization configuration
dynamic_quantizer = ORTQuantizer.from_pretrained(onnx_path, file_name="model_optimized.onnx")
dqconfig = AutoQuantizationConfig.avx512_vnni(is_static=False, per_channel=False)

# apply the quantization configuration to the model
model_quantized_path = dynamic_quantizer.quantize(
    save_dir=onnx_path,
    quantization_config=dqconfig,
)

```

## 3. Create Custom Handler for Inference Endpoints



```python
%%writefile handler.py
from typing import  Dict, List, Any
from optimum.onnxruntime import ORTModelForQuestionAnswering
from transformers import AutoTokenizer, pipeline


class EndpointHandler():
    def __init__(self, path=""):
        # load the optimized model
        self.model = ORTModelForQuestionAnswering.from_pretrained(path, file_name="model_optimized_quantized.onnx")
        self.tokenizer = AutoTokenizer.from_pretrained(path)
        # create pipeline
        self.pipeline = pipeline("question-answering", model=self.model, tokenizer=self.tokenizer)

    def __call__(self, data: Any) -> List[List[Dict[str, float]]]:
        """
        Args:
            data (:obj:):
                includes the input data and the parameters for the inference.
        Return:
            A :obj:`list`:. The list contains the answer and scores of the inference inputs
        """
        inputs = data.get("inputs", data)
        # run the model
        prediction = self.pipeline(**inputs)
        # return prediction
        return prediction
```

## 4. Test Custom Handler Locally



```python
from handler import EndpointHandler

# init handler
my_handler = EndpointHandler(path=".")

# prepare sample payload
context="Hello, my name is Philipp and I live in Nuremberg, Germany. Currently I am working as a Technical Lead at Hugging Face to democratize artificial intelligence through open source and open science. In the past I designed and implemented cloud-native machine learning architectures for fin-tech and insurance companies. I found my passion for cloud concepts and machine learning 5 years ago. Since then I never stopped learning. Currently, I am focusing myself in the area NLP and how to leverage models like BERT, Roberta, T5, ViT, and GPT2 to generate business value." 
question="As what is Philipp working?" 

payload = {"inputs": {"question": question, "context": context}}

# test the handler
my_handler(payload)
```


```python
from time import perf_counter
import numpy as np 

def measure_latency(handler,payload):
    latencies = []
    # warm up
    for _ in range(10):
        _ = handler(payload)
    # Timed run
    for _ in range(50):
        start_time = perf_counter()
        _ =  handler(payload)
        latency = perf_counter() - start_time
        latencies.append(latency)
    # Compute run statistics
    time_avg_ms = 1000 * np.mean(latencies)
    time_std_ms = 1000 * np.std(latencies)
    return f"Average latency (ms) - {time_avg_ms:.2f} +\- {time_std_ms:.2f}"

print(f"Optimized & Quantized model {measure_latency(my_handler,payload)}")
#     

```

`Optimized & Quantized model Average latency (ms) - 29.90 +\- 0.53`
`Vanilla model Average latency (ms) - 64.15 +\- 2.44`

## 5. Push to repository and create Inference Endpoint



```python
# add all our new files
!git add * 
# commit our files
!git commit -m "add custom handler"
# push the files to the hub
!git push
```