philschmid
/

roberta-base-squad2-optimized

ONNX

generic

endpoints-template

optimum

Inference Endpoints

Model card Files Files and versions Community

philschmid HF staff commited on Sep 12, 2022

Commit

141e879

•

1 Parent(s): a854397

add custom handler

Browse files

Files changed (2) hide show

README.md +145 -72
optimize_model.ipynb +46 -4

README.md CHANGED Viewed

@@ -7,20 +7,21 @@ tags:
 library_name: generic
 ---
-# Optimized and Quantized [sentence-transformers/all-MiniLM-L6-v2](https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2) with a custom pipeline.py
-This repository implements a `custom` task for `sentence-embeddings` for 🤗 Inference Endpoints for accelerated inference using [🤗 Optiumum](https://huggingface.co/docs/optimum/index). The code for the customized pipeline is in the [pipeline.py](https://huggingface.co/philschmid/all-MiniLM-L6-v2-optimum-embeddings/blob/main/pipeline.py).
-Below is also describe how we converted & optimized the model, based on the [Accelerate Sentence Transformers with Hugging Face Optimum](https://www.philschmid.de/optimize-sentence-transformers) blog post. You can also check out the [notebook](https://huggingface.co/philschmid/all-MiniLM-L6-v2-optimum-embeddings/blob/main/convert.ipynb).
-To use deploy this model a an Inference Endpoint you have to select `Custom` as task to use the `pipeline.py` file. -> _double check if it is selected_
 ### expected Request payload
 ```json
 {
-  "inputs": "The sky is a blue today and not gray",
 }
 ```
@@ -38,9 +39,8 @@ ENDPOINT_URL = ""
 HF_TOKEN = ""
-def predict(document_string:str=None):
-    payload = {"inputs": document_string}
     response = r.post(
         ENDPOINT_URL, headers={"Authorization": f"Bearer {HF_TOKEN}"}, json=payload
     )
@@ -48,65 +48,114 @@ def predict(document_string:str=None):
 prediction = predict(
-    path_to_image="The sky is a blue today and not gray"
 )
 ```
 expected output
 ```python
-{'embeddings': [[-0.021580450236797333,
-   0.021715054288506508,
-   0.00979710929095745,
-   -0.0005379787762649357,
-   0.04682469740509987,
-   -0.013600599952042103,
-   ...
 }
 ```
-## How to create your own optimized and quantized model
 Steps:
-[1. Convert model to ONNX](#1-convert-model-to-onnx)
-[2. Optimize & quantize model with Optimum](#2-optimize--quantize-model-with-optimum)
-[3. Create Custom Handler for Inference Endpoints](#3-create-custom-handler-for-inference-endpoints)
 Helpful links:
-* [Accelerate Sentence Transformers with Hugging Face Optimum](https://www.philschmid.de/optimize-sentence-transformers)
 * [Create Custom Handler Endpoints](https://link-to-docs)
 ## Setup & Installation
 ```python
 %%writefile requirements.txt
-optimum[onnxruntime]==1.3.0
 mkl-include
 mkl
 ```
-install requirements
 ```python
 !pip install -r requirements.txt
 ```
 ## 1. Convert model to ONNX
 ```python
-from optimum.onnxruntime import ORTModelForFeatureExtraction
 from transformers import AutoTokenizer
 from pathlib import Path
-model_id="sentence-transformers/all-MiniLM-L6-v2"
 onnx_path = Path(".")
 # load vanilla transformers and convert to onnx
-model = ORTModelForFeatureExtraction.from_pretrained(model_id, from_transformers=True)
 tokenizer = AutoTokenizer.from_pretrained(model_id)
 # save onnx checkpoint and tokenizer
@@ -122,55 +171,48 @@ tokenizer.save_pretrained(onnx_path)
 from optimum.onnxruntime import ORTOptimizer, ORTQuantizer
 from optimum.onnxruntime.configuration import OptimizationConfig, AutoQuantizationConfig
-# create ORTOptimizer and define optimization configuration
-optimizer = ORTOptimizer.from_pretrained(model_id, feature=model.pipeline_task)
 optimization_config = OptimizationConfig(optimization_level=99) # enable all optimizations
-# apply the optimization configuration to the model
-optimizer.export(
-    onnx_model_path=onnx_path / "model.onnx",
-    onnx_optimized_model_output_path=onnx_path / "model-optimized.onnx",
-    optimization_config=optimization_config,
-)
 # create ORTQuantizer and define quantization configuration
-dynamic_quantizer = ORTQuantizer.from_pretrained(model_id, feature=model.pipeline_task)
 dqconfig = AutoQuantizationConfig.avx512_vnni(is_static=False, per_channel=False)
 # apply the quantization configuration to the model
-model_quantized_path = dynamic_quantizer.export(
-    onnx_model_path=onnx_path / "model-optimized.onnx",
-    onnx_quantized_model_output_path=onnx_path / "model-quantized.onnx",
     quantization_config=dqconfig,
 )
 ```
 ## 3. Create Custom Handler for Inference Endpoints
 ```python
-%%writefile pipeline.py
 from typing import  Dict, List, Any
-from optimum.onnxruntime import ORTModelForFeatureExtraction
-from transformers import AutoTokenizer
-import torch.nn.functional as F
-import torch
-# copied from the model card
-def mean_pooling(model_output, attention_mask):
-    token_embeddings = model_output[0] #First element of model_output contains all token embeddings
-    input_mask_expanded = attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float()
-    return torch.sum(token_embeddings * input_mask_expanded, 1) / torch.clamp(input_mask_expanded.sum(1), min=1e-9)
-class PreTrainedPipeline():
     def __init__(self, path=""):
         # load the optimized model
-        self.model = ORTModelForFeatureExtraction.from_pretrained(path, file_name="model-quantized.onnx")
         self.tokenizer = AutoTokenizer.from_pretrained(path)
     def __call__(self, data: Any) -> List[List[Dict[str, float]]]:
         """
@@ -178,42 +220,73 @@ class PreTrainedPipeline():
             data (:obj:):
                 includes the input data and the parameters for the inference.
         Return:
-            A :obj:`list`:. The list contains the embeddings of the inference inputs
         """
         inputs = data.get("inputs", data)
-        # tokenize the input
-        encoded_inputs = self.tokenizer(inputs, padding=True, truncation=True, return_tensors='pt')
         # run the model
-        outputs = self.model(**encoded_inputs)
-        # Perform pooling
-        sentence_embeddings = mean_pooling(outputs, encoded_inputs['attention_mask'])
-        # Normalize embeddings
-        sentence_embeddings = F.normalize(sentence_embeddings, p=2, dim=1)
-        # postprocess the prediction
-        return {"embeddings": sentence_embeddings.tolist()}
 ```
-test custom pipeline
 ```python
-from pipeline import PreTrainedPipeline
 # init handler
-my_handler = PreTrainedPipeline(path=".")
 # prepare sample payload
-request = {"inputs": "I am quite excited how this will turn out"}
-# test the handler
-%timeit my_handler(request)
 ```
-results
 ```
-1.55 ms ± 2.04 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
 ```

 library_name: generic
 ---
+# Optimized and Quantized [deepset/roberta-base-squad2](https://huggingface.co/deepset/roberta-base-squad2) with a custom handler.py
+This repository implements a `custom` handler for `question-answering` for 🤗 Inference Endpoints for accelerated inference using [🤗 Optiumum](https://huggingface.co/docs/optimum/index). The code for the customized handler is in the [handler.py](https://huggingface.co/philschmid/roberta-base-squad2-optimized/blob/main/handler.py).
+Below is also describe how we converted & optimized the model, based on the [Accelerate Transformers with Hugging Face Optimum](https://huggingface.co/blog/optimum-inference) blog post. You can also check out the [notebook](https://huggingface.co/philschmid/roberta-base-squad2-optimized/blob/main/optimize_model.ipynb).
 ### expected Request payload
 ```json
 {
+    "inputs": {
+        "question": "As what is Philipp working?",
+        "context": "Hello, my name is Philipp and I live in Nuremberg, Germany. Currently I am working as a Technical Lead at Hugging Face to democratize artificial intelligence through open source and open science. In the past I designed and implemented cloud-native machine learning architectures for fin-tech and insurance companies. I found my passion for cloud concepts and machine learning 5 years ago. Since then I never stopped learning. Currently, I am focusing myself in the area NLP and how to leverage models like BERT, Roberta, T5, ViT, and GPT2 to generate business value."
+    }
 }
 ```
 HF_TOKEN = ""
+def predict(question:str=None,context:str=None):
+    payload = {"inputs": {"question": question, "context": context}}
     response = r.post(
         ENDPOINT_URL, headers={"Authorization": f"Bearer {HF_TOKEN}"}, json=payload
     )
 prediction = predict(
+    question="As what is Philipp working?",
+    context="Hello, my name is Philipp and I live in Nuremberg, Germany. Currently I am working as a Technical Lead at Hugging Face to democratize artificial intelligence through open source and open science."
 )
 ```
 expected output
 ```python
+{
+    'score': 0.4749588668346405,
+    'start': 88,
+    'end': 102,
+    'answer': 'Technical Lead'
 }
 ```
+# Convert & Optimize model with Optimum
 Steps:
+1. [Convert model to ONNX](#1-convert-model-to-onnx)
+2. [Optimize & quantize model with Optimum](#2-optimize--quantize-model-with-optimum)
+3. [Create Custom Handler for Inference Endpoints](#3-create-custom-handler-for-inference-endpoints)
+4. [Test Custom Handler Locally](#4-test-custom-handler-locally)
+5. [Push to repository and create Inference Endpoint](#5-push-to-repository-and-create-inference-endpoint)
 Helpful links:
+* [Accelerate Transformers with Hugging Face Optimum](https://huggingface.co/blog/optimum-inference)
+* [Optimizing Transformers for GPUs with Optimum](https://www.philschmid.de/optimizing-transformers-with-optimum-gpu)
+* [Optimum Documentation](https://huggingface.co/docs/optimum/onnxruntime/modeling_ort)
 * [Create Custom Handler Endpoints](https://link-to-docs)
 ## Setup & Installation
 ```python
 %%writefile requirements.txt
+optimum[onnxruntime]==1.4.0
 mkl-include
 mkl
 ```
 ```python
 !pip install -r requirements.txt
 ```
+## 0. Base line Performance
+```python
+from transformers import pipeline
+qa = pipeline("question-answering",model="deepset/roberta-base-squad2")
+```
+Okay, let's test the performance (latency) with sequence length of 128.
+```python
+context="Hello, my name is Philipp and I live in Nuremberg, Germany. Currently I am working as a Technical Lead at Hugging Face to democratize artificial intelligence through open source and open science. In the past I designed and implemented cloud-native machine learning architectures for fin-tech and insurance companies. I found my passion for cloud concepts and machine learning 5 years ago. Since then I never stopped learning. Currently, I am focusing myself in the area NLP and how to leverage models like BERT, Roberta, T5, ViT, and GPT2 to generate business value."
+question="As what is Philipp working?"
+payload = {"inputs": {"question": question, "context": context}}
+```
+```python
+from time import perf_counter
+import numpy as np
+def measure_latency(pipe,payload):
+    latencies = []
+    # warm up
+    for _ in range(10):
+        _ = pipe(question=payload["inputs"]["question"], context=payload["inputs"]["context"])
+    # Timed run
+    for _ in range(50):
+        start_time = perf_counter()
+        _ =  pipe(question=payload["inputs"]["question"], context=payload["inputs"]["context"])
+        latency = perf_counter() - start_time
+        latencies.append(latency)
+    # Compute run statistics
+    time_avg_ms = 1000 * np.mean(latencies)
+    time_std_ms = 1000 * np.std(latencies)
+    return f"Average latency (ms) - {time_avg_ms:.2f} +\- {time_std_ms:.2f}"
+print(f"Vanilla model {measure_latency(qa,payload)}")
+#     Vanilla model Average latency (ms) - 64.15 +\- 2.44
+```
 ## 1. Convert model to ONNX
 ```python
+from optimum.onnxruntime import ORTModelForQuestionAnswering
 from transformers import AutoTokenizer
 from pathlib import Path
+model_id="deepset/roberta-base-squad2"
 onnx_path = Path(".")
 # load vanilla transformers and convert to onnx
+model = ORTModelForQuestionAnswering.from_pretrained(model_id, from_transformers=True)
 tokenizer = AutoTokenizer.from_pretrained(model_id)
 # save onnx checkpoint and tokenizer
 from optimum.onnxruntime import ORTOptimizer, ORTQuantizer
 from optimum.onnxruntime.configuration import OptimizationConfig, AutoQuantizationConfig
+# Create the optimizer
+optimizer = ORTOptimizer.from_pretrained(model)
+# Define the optimization strategy by creating the appropriate configuration
 optimization_config = OptimizationConfig(optimization_level=99) # enable all optimizations
+# Optimize the model
+optimizer.optimize(save_dir=onnx_path, optimization_config=optimization_config)
+```
+```python
 # create ORTQuantizer and define quantization configuration
+dynamic_quantizer = ORTQuantizer.from_pretrained(onnx_path, file_name="model_optimized.onnx")
 dqconfig = AutoQuantizationConfig.avx512_vnni(is_static=False, per_channel=False)
 # apply the quantization configuration to the model
+model_quantized_path = dynamic_quantizer.quantize(
+    save_dir=onnx_path,
     quantization_config=dqconfig,
 )
 ```
 ## 3. Create Custom Handler for Inference Endpoints
 ```python
+%%writefile handler.py
 from typing import  Dict, List, Any
+from optimum.onnxruntime import ORTModelForQuestionAnswering
+from transformers import AutoTokenizer, pipeline
+class EndpointHandler():
     def __init__(self, path=""):
         # load the optimized model
+        self.model = ORTModelForQuestionAnswering.from_pretrained(path, file_name="model_optimized_quantized.onnx")
         self.tokenizer = AutoTokenizer.from_pretrained(path)
+        # create pipeline
+        self.pipeline = pipeline("question-answering", model=self.model, tokenizer=self.tokenizer)
     def __call__(self, data: Any) -> List[List[Dict[str, float]]]:
         """
             data (:obj:):
                 includes the input data and the parameters for the inference.
         Return:
+            A :obj:`list`:. The list contains the answer and scores of the inference inputs
         """
         inputs = data.get("inputs", data)
         # run the model
+        prediction = self.pipeline(**inputs)
+        # return prediction
+        return prediction
 ```
+## 4. Test Custom Handler Locally
 ```python
+from handler import EndpointHandler
 # init handler
+my_handler = EndpointHandler(path=".")
 # prepare sample payload
+context="Hello, my name is Philipp and I live in Nuremberg, Germany. Currently I am working as a Technical Lead at Hugging Face to democratize artificial intelligence through open source and open science. In the past I designed and implemented cloud-native machine learning architectures for fin-tech and insurance companies. I found my passion for cloud concepts and machine learning 5 years ago. Since then I never stopped learning. Currently, I am focusing myself in the area NLP and how to leverage models like BERT, Roberta, T5, ViT, and GPT2 to generate business value."
+question="As what is Philipp working?"
+payload = {"inputs": {"question": question, "context": context}}
+# test the handler
+my_handler(payload)
 ```
+```python
+from time import perf_counter
+import numpy as np
+def measure_latency(handler,payload):
+    latencies = []
+    # warm up
+    for _ in range(10):
+        _ = handler(payload)
+    # Timed run
+    for _ in range(50):
+        start_time = perf_counter()
+        _ =  handler(payload)
+        latency = perf_counter() - start_time
+        latencies.append(latency)
+    # Compute run statistics
+    time_avg_ms = 1000 * np.mean(latencies)
+    time_std_ms = 1000 * np.std(latencies)
+    return f"Average latency (ms) - {time_avg_ms:.2f} +\- {time_std_ms:.2f}"
+print(f"Optimized & Quantized model {measure_latency(my_handler,payload)}")
+#     Optimized & Quantized model Average latency (ms) - 29.90 +\- 0.53
 ```
+`Vanilla model Average latency (ms) - 64.15 +\- 2.44`
+## 5. Push to repository and create Inference Endpoint
+```python
+# add all our new files
+!git add *
+# commit our files
+!git commit -m "add custom handler"
+# push the files to the hub
+!git push
 ```

optimize_model.ipynb CHANGED Viewed

@@ -84,9 +84,20 @@
   },
   {
    "cell_type": "code",
-   "execution_count": 8,
    "metadata": {},
-   "outputs": [],
    "source": [
     "context=\"Hello, my name is Philipp and I live in Nuremberg, Germany. Currently I am working as a Technical Lead at Hugging Face to democratize artificial intelligence through open source and open science. In the past I designed and implemented cloud-native machine learning architectures for fin-tech and insurance companies. I found my passion for cloud concepts and machine learning 5 years ago. Since then I never stopped learning. Currently, I am focusing myself in the area NLP and how to leverage models like BERT, Roberta, T5, ViT, and GPT2 to generate business value.\" \n",
     "question=\"As what is Philipp working?\" \n",
@@ -395,9 +406,33 @@
   },
   {
    "cell_type": "code",
-   "execution_count": null,
    "metadata": {},
-   "outputs": [],
    "source": [
     "# add all our new files\n",
     "!git add * \n",
@@ -406,6 +441,13 @@
     "# push the files to the hub\n",
     "!git push"
    ]
   }
  ],
  "metadata": {

   },
   {
    "cell_type": "code",
+   "execution_count": 3,
    "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "text/plain": [
+       "'{\"inputs\": {\"question\": \"As what is Philipp working?\", \"context\": \"Hello, my name is Philipp and I live in Nuremberg, Germany. Currently I am working as a Technical Lead at Hugging Face to democratize artificial intelligence through open source and open science. In the past I designed and implemented cloud-native machine learning architectures for fin-tech and insurance companies. I found my passion for cloud concepts and machine learning 5 years ago. Since then I never stopped learning. Currently, I am focusing myself in the area NLP and how to leverage models like BERT, Roberta, T5, ViT, and GPT2 to generate business value.\"}}'"
+      ]
+     },
+     "execution_count": 3,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
    "source": [
     "context=\"Hello, my name is Philipp and I live in Nuremberg, Germany. Currently I am working as a Technical Lead at Hugging Face to democratize artificial intelligence through open source and open science. In the past I designed and implemented cloud-native machine learning architectures for fin-tech and insurance companies. I found my passion for cloud concepts and machine learning 5 years ago. Since then I never stopped learning. Currently, I am focusing myself in the area NLP and how to leverage models like BERT, Roberta, T5, ViT, and GPT2 to generate business value.\" \n",
     "question=\"As what is Philipp working?\" \n",
   },
   {
    "cell_type": "code",
+   "execution_count": 1,
    "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "[main a854397] add custom handler\n",
+      " 14 files changed, 151227 insertions(+)\n",
+      " create mode 100644 README.md\n",
+      " create mode 100644 config.json\n",
+      " create mode 100644 handler.py\n",
+      " create mode 100644 merges.txt\n",
+      " create mode 100644 model.onnx\n",
+      " create mode 100644 model_optimized.onnx\n",
+      " create mode 100644 model_optimized_quantized.onnx\n",
+      " create mode 100644 optimize_model.ipynb\n",
+      " create mode 100644 ort_config.json\n",
+      " create mode 100644 requirements.txt\n",
+      " create mode 100644 special_tokens_map.json\n",
+      " create mode 100644 tokenizer.json\n",
+      " create mode 100644 tokenizer_config.json\n",
+      " create mode 100644 vocab.json\n",
+      "Username for 'https://huggingface.co': ^C\n"
+     ]
+    }
+   ],
    "source": [
     "# add all our new files\n",
     "!git add * \n",
     "# push the files to the hub\n",
     "!git push"
    ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": []
   }
  ],
  "metadata": {