{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Convert & Optimize model with Optimum \n",
    "\n",
    "\n",
    "Steps:\n",
    "1. Convert model to ONNX\n",
    "2. Optimize & quantize model with Optimum\n",
    "3. Create Custom Handler for Inference Endpoints\n",
    "4. Test Custom Handler Locally\n",
    "5. Push to repository and create Inference Endpoint\n",
    "\n",
    "Helpful links:\n",
    "* [Accelerate Transformers with Hugging Face Optimum](https://huggingface.co/blog/optimum-inference)\n",
    "* [Optimizing Transformers for GPUs with Optimum](https://www.philschmid.de/optimizing-transformers-with-optimum-gpu)\n",
    "* [Optimum Documentation](https://huggingface.co/docs/optimum/onnxruntime/modeling_ort)\n",
    "* [Create Custom Handler Endpoints](https://link-to-docs)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Setup & Installation"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Writing requirements.txt\n"
     ]
    }
   ],
   "source": [
    "%%writefile requirements.txt\n",
    "optimum[onnxruntime]==1.4.0\n",
    "mkl-include\n",
    "mkl"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "!pip install -r requirements.txt"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 0. Base line Performance\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from transformers import pipeline\n",
    "\n",
    "qa = pipeline(\"question-answering\",model=\"deepset/roberta-base-squad2\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Okay, let's test the performance (latency) with sequence length of 128."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "'{\"inputs\": {\"question\": \"As what is Philipp working?\", \"context\": \"Hello, my name is Philipp and I live in Nuremberg, Germany. Currently I am working as a Technical Lead at Hugging Face to democratize artificial intelligence through open source and open science. In the past I designed and implemented cloud-native machine learning architectures for fin-tech and insurance companies. I found my passion for cloud concepts and machine learning 5 years ago. Since then I never stopped learning. Currently, I am focusing myself in the area NLP and how to leverage models like BERT, Roberta, T5, ViT, and GPT2 to generate business value.\"}}'"
      ]
     },
     "execution_count": 3,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "context=\"Hello, my name is Philipp and I live in Nuremberg, Germany. Currently I am working as a Technical Lead at Hugging Face to democratize artificial intelligence through open source and open science. In the past I designed and implemented cloud-native machine learning architectures for fin-tech and insurance companies. I found my passion for cloud concepts and machine learning 5 years ago. Since then I never stopped learning. Currently, I am focusing myself in the area NLP and how to leverage models like BERT, Roberta, T5, ViT, and GPT2 to generate business value.\" \n",
    "question=\"As what is Philipp working?\" \n",
    "\n",
    "payload = {\"inputs\": {\"question\": question, \"context\": context}}"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Vanilla model Average latency (ms) - 64.15 +\\- 2.44\n"
     ]
    }
   ],
   "source": [
    "from time import perf_counter\n",
    "import numpy as np \n",
    "\n",
    "def measure_latency(pipe,payload):\n",
    "    latencies = []\n",
    "    # warm up\n",
    "    for _ in range(10):\n",
    "        _ = pipe(question=payload[\"inputs\"][\"question\"], context=payload[\"inputs\"][\"context\"])\n",
    "    # Timed run\n",
    "    for _ in range(50):\n",
    "        start_time = perf_counter()\n",
    "        _ =  pipe(question=payload[\"inputs\"][\"question\"], context=payload[\"inputs\"][\"context\"])\n",
    "        latency = perf_counter() - start_time\n",
    "        latencies.append(latency)\n",
    "    # Compute run statistics\n",
    "    time_avg_ms = 1000 * np.mean(latencies)\n",
    "    time_std_ms = 1000 * np.std(latencies)\n",
    "    return f\"Average latency (ms) - {time_avg_ms:.2f} +\\- {time_std_ms:.2f}\"\n",
    "\n",
    "print(f\"Vanilla model {measure_latency(qa,payload)}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 1. Convert model to ONNX"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "application/vnd.jupyter.widget-view+json": {
       "model_id": "df00c03d67b546bf8a3d1a327b9380f5",
       "version_major": 2,
       "version_minor": 0
      },
      "text/plain": [
       "Downloading:   0%|          | 0.00/571 [00:00<?, ?B/s]"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "data": {
      "text/plain": [
       "('./tokenizer_config.json',\n",
       " './special_tokens_map.json',\n",
       " './vocab.json',\n",
       " './merges.txt',\n",
       " './added_tokens.json',\n",
       " './tokenizer.json')"
      ]
     },
     "execution_count": 10,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "from optimum.onnxruntime import ORTModelForQuestionAnswering\n",
    "from transformers import AutoTokenizer\n",
    "from pathlib import Path\n",
    "\n",
    "\n",
    "model_id=\"deepset/roberta-base-squad2\"\n",
    "onnx_path = Path(\".\")\n",
    "\n",
    "# load vanilla transformers and convert to onnx\n",
    "model = ORTModelForQuestionAnswering.from_pretrained(model_id, from_transformers=True)\n",
    "tokenizer = AutoTokenizer.from_pretrained(model_id)\n",
    "\n",
    "# save onnx checkpoint and tokenizer\n",
    "model.save_pretrained(onnx_path)\n",
    "tokenizer.save_pretrained(onnx_path)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 2. Optimize & quantize model with Optimum"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "2022-09-12 18:47:03.240390005 [W:onnxruntime:, inference_session.cc:1488 Initialize] Serializing optimized model with Graph Optimization level greater than ORT_ENABLE_EXTENDED and the NchwcTransformer enabled. The generated model may contain hardware specific optimizations, and should only be used in the same environment the model was optimized in.\n"
     ]
    },
    {
     "data": {
      "text/plain": [
       "PosixPath('.')"
      ]
     },
     "execution_count": 11,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "from optimum.onnxruntime import ORTOptimizer, ORTQuantizer\n",
    "from optimum.onnxruntime.configuration import OptimizationConfig, AutoQuantizationConfig\n",
    "\n",
    "# Create the optimizer\n",
    "optimizer = ORTOptimizer.from_pretrained(model)\n",
    "\n",
    "# Define the optimization strategy by creating the appropriate configuration\n",
    "optimization_config = OptimizationConfig(optimization_level=99) # enable all optimizations\n",
    "\n",
    "# Optimize the model\n",
    "optimizer.optimize(save_dir=onnx_path, optimization_config=optimization_config)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {},
   "outputs": [],
   "source": [
    "# create ORTQuantizer and define quantization configuration\n",
    "dynamic_quantizer = ORTQuantizer.from_pretrained(onnx_path, file_name=\"model_optimized.onnx\")\n",
    "dqconfig = AutoQuantizationConfig.avx512_vnni(is_static=False, per_channel=False)\n",
    "\n",
    "# apply the quantization configuration to the model\n",
    "model_quantized_path = dynamic_quantizer.quantize(\n",
    "    save_dir=onnx_path,\n",
    "    quantization_config=dqconfig,\n",
    ")\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 3. Create Custom Handler for Inference Endpoints\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Overwriting handler.py\n"
     ]
    }
   ],
   "source": [
    "%%writefile handler.py\n",
    "from typing import  Dict, List, Any\n",
    "from optimum.onnxruntime import ORTModelForQuestionAnswering\n",
    "from transformers import AutoTokenizer, pipeline\n",
    "\n",
    "\n",
    "class EndpointHandler():\n",
    "    def __init__(self, path=\"\"):\n",
    "        # load the optimized model\n",
    "        self.model = ORTModelForQuestionAnswering.from_pretrained(path, file_name=\"model_optimized_quantized.onnx\")\n",
    "        self.tokenizer = AutoTokenizer.from_pretrained(path)\n",
    "        # create pipeline\n",
    "        self.pipeline = pipeline(\"question-answering\", model=self.model, tokenizer=self.tokenizer)\n",
    "\n",
    "    def __call__(self, data: Any) -> List[List[Dict[str, float]]]:\n",
    "        \"\"\"\n",
    "        Args:\n",
    "            data (:obj:):\n",
    "                includes the input data and the parameters for the inference.\n",
    "        Return:\n",
    "            A :obj:`list`:. The list contains the answer and scores of the inference inputs\n",
    "        \"\"\"\n",
    "        inputs = data.get(\"inputs\", data)\n",
    "        # run the model\n",
    "        prediction = self.pipeline(**inputs)\n",
    "        # return prediction\n",
    "        return prediction"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 4. Test Custom Handler Locally\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "{'score': 0.4749588668346405,\n",
       " 'start': 88,\n",
       " 'end': 102,\n",
       " 'answer': 'Technical Lead'}"
      ]
     },
     "execution_count": 2,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "from handler import EndpointHandler\n",
    "\n",
    "# init handler\n",
    "my_handler = EndpointHandler(path=\".\")\n",
    "\n",
    "# prepare sample payload\n",
    "context=\"Hello, my name is Philipp and I live in Nuremberg, Germany. Currently I am working as a Technical Lead at Hugging Face to democratize artificial intelligence through open source and open science. In the past I designed and implemented cloud-native machine learning architectures for fin-tech and insurance companies. I found my passion for cloud concepts and machine learning 5 years ago. Since then I never stopped learning. Currently, I am focusing myself in the area NLP and how to leverage models like BERT, Roberta, T5, ViT, and GPT2 to generate business value.\" \n",
    "question=\"As what is Philipp working?\" \n",
    "\n",
    "payload = {\"inputs\": {\"question\": question, \"context\": context}}\n",
    "\n",
    "# test the handler\n",
    "my_handler(payload)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Optimized & Quantized model Average latency (ms) - 29.90 +\\- 0.53\n"
     ]
    }
   ],
   "source": [
    "from time import perf_counter\n",
    "import numpy as np \n",
    "\n",
    "def measure_latency(handler,payload):\n",
    "    latencies = []\n",
    "    # warm up\n",
    "    for _ in range(10):\n",
    "        _ = handler(payload)\n",
    "    # Timed run\n",
    "    for _ in range(50):\n",
    "        start_time = perf_counter()\n",
    "        _ =  handler(payload)\n",
    "        latency = perf_counter() - start_time\n",
    "        latencies.append(latency)\n",
    "    # Compute run statistics\n",
    "    time_avg_ms = 1000 * np.mean(latencies)\n",
    "    time_std_ms = 1000 * np.std(latencies)\n",
    "    return f\"Average latency (ms) - {time_avg_ms:.2f} +\\- {time_std_ms:.2f}\"\n",
    "\n",
    "print(f\"Optimized & Quantized model {measure_latency(my_handler,payload)}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "`Vanilla model Average latency (ms) - 64.15 +\\- 2.44`"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 5. Push to repository and create Inference Endpoint\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[main a854397] add custom handler\n",
      " 14 files changed, 151227 insertions(+)\n",
      " create mode 100644 README.md\n",
      " create mode 100644 config.json\n",
      " create mode 100644 handler.py\n",
      " create mode 100644 merges.txt\n",
      " create mode 100644 model.onnx\n",
      " create mode 100644 model_optimized.onnx\n",
      " create mode 100644 model_optimized_quantized.onnx\n",
      " create mode 100644 optimize_model.ipynb\n",
      " create mode 100644 ort_config.json\n",
      " create mode 100644 requirements.txt\n",
      " create mode 100644 special_tokens_map.json\n",
      " create mode 100644 tokenizer.json\n",
      " create mode 100644 tokenizer_config.json\n",
      " create mode 100644 vocab.json\n",
      "Username for 'https://huggingface.co': ^C\n"
     ]
    }
   ],
   "source": [
    "# add all our new files\n",
    "!git add * \n",
    "# commit our files\n",
    "!git commit -m \"add custom handler\"\n",
    "# push the files to the hub\n",
    "!git push"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3.9.12 ('az': conda)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.9.12"
  },
  "orig_nbformat": 4,
  "vscode": {
   "interpreter": {
    "hash": "bddb99ecda5b40a820d97bf37f3ff3a89fb9dbcf726ae84d28624ac628a665b4"
   }
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}