{ "cells": [ { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "# Text-to-Image Generation with Stable Diffusion and OpenVINOโข\n", "\n", "Stable Diffusion is a text-to-image latent diffusion model created by the researchers and engineers from [CompVis](https://github.com/CompVis), [Stability AI](https://stability.ai/) and [LAION](https://laion.ai/). It is trained on 512x512 images from a subset of the [LAION-5B](https://laion.ai/blog/laion-5b/) database. This model uses a frozen CLIP ViT-L/14 text encoder to condition the model on text prompts. With its 860M UNet and 123M text encoder.\n", "See the [model card](https://huggingface.co/CompVis/stable-diffusion) for more information.\n", "\n", "General diffusion models are machine learning systems that are trained to denoise random gaussian noise step by step, to get to a sample of interest, such as an image.\n", "Diffusion models have shown to achieve state-of-the-art results for generating image data. But one downside of diffusion models is that the reverse denoising process is slow. In addition, these models consume a lot of memory because they operate in pixel space, which becomes unreasonably expensive when generating high-resolution images. Therefore, it is challenging to train these models and also use them for inference. OpenVINO brings capabilities to run model inference on Intel hardware and opens the door to the fantastic world of diffusion models for everyone!\n", "\n", "Model capabilities are not limited text-to-image only, it also is able solve additional tasks, for example text-guided image-to-image generation and inpainting. This tutorial also considers how to run text-guided image-to-image generation using Stable Diffusion.\n", "\n", "\n", "This notebook demonstrates how to convert and run stable diffusion model using OpenVINO.\n", "\n", "Notebook contains the following steps:\n", "1. Convert PyTorch models to ONNX format.\n", "2. Convert ONNX models to OpenVINO IR format, using Model Optimizer tool.\n", "3. Run Stable Diffusion pipeline with OpenVINO." ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "## Prerequisites\n", "\n", "**The following is needed only if you want to use the original model. If not, you do not have to do anything. Just run the notebook.**\n", "\n", ">**Note**:\n", ">The original model (for example, `stable-diffusion-v1-4`) requires you to accept the model license before downloading or using its weights. Visit the [stable-diffusion-v1-4 card](https://huggingface.co/CompVis/stable-diffusion-v1-4) to read and accept the license before you proceed.\n", ">To use this diffusion model, you must be a registered user in ๐ค Hugging Face Hub. You will need to use an access token for the code below to run. For more information on access tokens, refer to [this section of the documentation](https://huggingface.co/docs/hub/security-tokens).\n", ">You can login on Hugging Face Hub in notebook environment, using following code:\n", "```python\n", "## login to huggingfacehub to get access to pretrained model \n", "from huggingface_hub import notebook_login, whoami\n", "\n", "try:\n", " whoami()\n", " print('Authorization token already provided')\n", "except OSError:\n", " notebook_login()\n", "```\n", "\n", "This tutorial uses a Stable Diffusion model, fine-tuned using images from Midjourney v4 (another popular solution for text to image generation).\n", "You can find more details about this model on the [model card](https://huggingface.co/prompthero/openjourney). The same steps for conversion and running the pipeline are applicable to other solutions based on Stable Diffusion.\n" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\u001b[31mERROR: Could not open requirements file: [Errno 2] No such file or directory: 'requirements.txt'\u001b[0m\u001b[31m\n", "\u001b[0m" ] } ], "source": [ "!pip install -r requirements.txt" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "## Create Pytorch Models pipeline\n", "StableDiffusionPipeline is an end-to-end inference pipeline that you can use to generate images from text with just a few lines of code.\n", "\n", "First, load the pre-trained weights of all components of the model." ] }, { "cell_type": "code", "execution_count": 13, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "8NS0mazM0icN", "outputId": "b98b31ca-65ca-4bb0-fad2-2599c668ccad" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Authorization token already provided\n" ] }, { "data": { "application/json": { "ascii": false, "bar_format": null, "colour": null, "elapsed": 0.00508570671081543, "initial": 0, "n": 0, "ncols": null, "nrows": null, "postfix": null, "prefix": "Downloading (โฆ)ain/model_index.json", "rate": null, "total": 541, "unit": "B", "unit_divisor": 1000, "unit_scale": true }, "application/vnd.jupyter.widget-view+json": { "model_id": "2dc51f778b1a4ef69192ed1d9ce72434", "version_major": 2, "version_minor": 0 }, "text/plain": [ "Downloading (โฆ)ain/model_index.json: 0%| | 0.00/541 [00:00, ?B/s]" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "application/json": { "ascii": false, "bar_format": null, "colour": null, "elapsed": 0.006789684295654297, "initial": 0, "n": 0, "ncols": null, "nrows": null, "postfix": null, "prefix": "Fetching 19 files", "rate": null, "total": 19, "unit": "it", "unit_divisor": 1000, "unit_scale": false }, "application/vnd.jupyter.widget-view+json": { "model_id": "c31fb2635dcd48b6a02fd5ee490db794", "version_major": 2, "version_minor": 0 }, "text/plain": [ "Fetching 19 files: 0%| | 0/19 [00:00, ?it/s]" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "application/json": { "ascii": false, "bar_format": null, "colour": null, "elapsed": 0.005452871322631836, "initial": 0, "n": 0, "ncols": null, "nrows": null, "postfix": null, "prefix": "Downloading (โฆ)cheduler_config.json", "rate": null, "total": 308, "unit": "B", "unit_divisor": 1000, "unit_scale": true }, "application/vnd.jupyter.widget-view+json": { "model_id": "cc841b7d85e44d49bb607c237990de6c", "version_major": 2, "version_minor": 0 }, "text/plain": [ "Downloading (โฆ)cheduler_config.json: 0%| | 0.00/308 [00:00, ?B/s]" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "application/json": { "ascii": false, "bar_format": null, "colour": null, "elapsed": 0.00491786003112793, "initial": 0, "n": 0, "ncols": null, "nrows": null, "postfix": null, "prefix": "Downloading (โฆ)_checker/config.json", "rate": null, "total": 4841, "unit": "B", "unit_divisor": 1000, "unit_scale": true }, "application/vnd.jupyter.widget-view+json": { "model_id": "684cc635ad6347e48152941ed2af85ba", "version_major": 2, "version_minor": 0 }, "text/plain": [ "Downloading (โฆ)_checker/config.json: 0%| | 0.00/4.84k [00:00, ?B/s]" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "application/json": { "ascii": false, "bar_format": null, "colour": null, "elapsed": 0.005473613739013672, "initial": 0, "n": 0, "ncols": null, "nrows": null, "postfix": null, "prefix": "Downloading (โฆ)_encoder/config.json", "rate": null, "total": 612, "unit": "B", "unit_divisor": 1000, "unit_scale": true }, "application/vnd.jupyter.widget-view+json": { "model_id": "c3bdeee56ff749abb0944e957916a874", "version_major": 2, "version_minor": 0 }, "text/plain": [ "Downloading (โฆ)_encoder/config.json: 0%| | 0.00/612 [00:00, ?B/s]" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "application/json": { "ascii": false, "bar_format": null, "colour": null, "elapsed": 0.004355192184448242, "initial": 0, "n": 0, "ncols": null, "nrows": null, "postfix": null, "prefix": "Downloading (โฆ)tokenizer/merges.txt", "rate": null, "total": 524619, "unit": "B", "unit_divisor": 1000, "unit_scale": true }, "application/vnd.jupyter.widget-view+json": { "model_id": "06cfe2d41faa4e94bd5af9057150e407", "version_major": 2, "version_minor": 0 }, "text/plain": [ "Downloading (โฆ)tokenizer/merges.txt: 0%| | 0.00/525k [00:00, ?B/s]" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "application/json": { "ascii": false, "bar_format": null, "colour": null, "elapsed": 0.0026357173919677734, "initial": 0, "n": 0, "ncols": null, "nrows": null, "postfix": null, "prefix": "Downloading pytorch_model.bin", "rate": null, "total": 492305335, "unit": "B", "unit_divisor": 1000, "unit_scale": true }, "application/vnd.jupyter.widget-view+json": { "model_id": "d10b176ce85844568b02ac7864a21914", "version_major": 2, "version_minor": 0 }, "text/plain": [ "Downloading pytorch_model.bin: 0%| | 0.00/492M [00:00, ?B/s]" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "application/json": { "ascii": false, "bar_format": null, "colour": null, "elapsed": 0.005003929138183594, "initial": 0, "n": 0, "ncols": null, "nrows": null, "postfix": null, "prefix": "Downloading (โฆ)rocessor_config.json", "rate": null, "total": 342, "unit": "B", "unit_divisor": 1000, "unit_scale": true }, "application/vnd.jupyter.widget-view+json": { "model_id": "9fd43d57256b498589d8f07fd15778cd", "version_major": 2, "version_minor": 0 }, "text/plain": [ "Downloading (โฆ)rocessor_config.json: 0%| | 0.00/342 [00:00, ?B/s]" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "application/json": { "ascii": false, "bar_format": null, "colour": null, "elapsed": 0.0027108192443847656, "initial": 0, "n": 0, "ncols": null, "nrows": null, "postfix": null, "prefix": "Downloading model.safetensors", "rate": null, "total": 1215981830, "unit": "B", "unit_divisor": 1000, "unit_scale": true }, "application/vnd.jupyter.widget-view+json": { "model_id": "a8450ed8a2184c1b8c4a01938e74d43a", "version_major": 2, "version_minor": 0 }, "text/plain": [ "Downloading model.safetensors: 0%| | 0.00/1.22G [00:00, ?B/s]" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "application/json": { "ascii": false, "bar_format": null, "colour": null, "elapsed": 0.0032033920288085938, "initial": 0, "n": 0, "ncols": null, "nrows": null, "postfix": null, "prefix": "Downloading pytorch_model.bin", "rate": null, "total": 1216061799, "unit": "B", "unit_divisor": 1000, "unit_scale": true }, "application/vnd.jupyter.widget-view+json": { "model_id": "1995e2cbe30a41ebb6840eb74a151248", "version_major": 2, "version_minor": 0 }, "text/plain": [ "Downloading pytorch_model.bin: 0%| | 0.00/1.22G [00:00, ?B/s]" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "application/json": { "ascii": false, "bar_format": null, "colour": null, "elapsed": 0.0034754276275634766, "initial": 0, "n": 0, "ncols": null, "nrows": null, "postfix": null, "prefix": "Downloading model.safetensors", "rate": null, "total": 492265874, "unit": "B", "unit_divisor": 1000, "unit_scale": true }, "application/vnd.jupyter.widget-view+json": { "model_id": "d6527e30e0804ce3957ee92eae2426f0", "version_major": 2, "version_minor": 0 }, "text/plain": [ "Downloading model.safetensors: 0%| | 0.00/492M [00:00, ?B/s]" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "application/json": { "ascii": false, "bar_format": null, "colour": null, "elapsed": 0.0048868656158447266, "initial": 0, "n": 0, "ncols": null, "nrows": null, "postfix": null, "prefix": "Downloading (โฆ)okenizer_config.json", "rate": null, "total": 806, "unit": "B", "unit_divisor": 1000, "unit_scale": true }, "application/vnd.jupyter.widget-view+json": { "model_id": "ebed17aaff5948f384437c60f934a9cb", "version_major": 2, "version_minor": 0 }, "text/plain": [ "Downloading (โฆ)okenizer_config.json: 0%| | 0.00/806 [00:00, ?B/s]" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "application/json": { "ascii": false, "bar_format": null, "colour": null, "elapsed": 0.00540471076965332, "initial": 0, "n": 0, "ncols": null, "nrows": null, "postfix": null, "prefix": "Downloading (โฆ)cial_tokens_map.json", "rate": null, "total": 472, "unit": "B", "unit_divisor": 1000, "unit_scale": true }, "application/vnd.jupyter.widget-view+json": { "model_id": "fbcac1658c1a48a4b497a7bb234a0a0a", "version_major": 2, "version_minor": 0 }, "text/plain": [ "Downloading (โฆ)cial_tokens_map.json: 0%| | 0.00/472 [00:00, ?B/s]" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "application/json": { "ascii": false, "bar_format": null, "colour": null, "elapsed": 0.002624034881591797, "initial": 0, "n": 0, "ncols": null, "nrows": null, "postfix": null, "prefix": "Downloading (โฆ)on_pytorch_model.bin", "rate": null, "total": 3438354725, "unit": "B", "unit_divisor": 1000, "unit_scale": true }, "application/vnd.jupyter.widget-view+json": { "model_id": "baa78c0fd1454c48a20d091cf19c7ca8", "version_major": 2, "version_minor": 0 }, "text/plain": [ "Downloading (โฆ)on_pytorch_model.bin: 0%| | 0.00/3.44G [00:00, ?B/s]" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "application/json": { "ascii": false, "bar_format": null, "colour": null, "elapsed": 0.006489753723144531, "initial": 0, "n": 0, "ncols": null, "nrows": null, "postfix": null, "prefix": "Downloading (โฆ)tokenizer/vocab.json", "rate": null, "total": 1059962, "unit": "B", "unit_divisor": 1000, "unit_scale": true }, "application/vnd.jupyter.widget-view+json": { "model_id": "fc3a5b2d49b84ce2b45a52785a8d6a62", "version_major": 2, "version_minor": 0 }, "text/plain": [ "Downloading (โฆ)tokenizer/vocab.json: 0%| | 0.00/1.06M [00:00, ?B/s]" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "application/json": { "ascii": false, "bar_format": null, "colour": null, "elapsed": 0.005102634429931641, "initial": 0, "n": 0, "ncols": null, "nrows": null, "postfix": null, "prefix": "Downloading (โฆ)e03/unet/config.json", "rate": null, "total": 743, "unit": "B", "unit_divisor": 1000, "unit_scale": true }, "application/vnd.jupyter.widget-view+json": { "model_id": "8aa9e1668db94519948c154cbbcfff8e", "version_major": 2, "version_minor": 0 }, "text/plain": [ "Downloading (โฆ)e03/unet/config.json: 0%| | 0.00/743 [00:00, ?B/s]" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "application/json": { "ascii": false, "bar_format": null, "colour": null, "elapsed": 0.0036373138427734375, "initial": 0, "n": 0, "ncols": null, "nrows": null, "postfix": null, "prefix": "Downloading (โฆ)ch_model.safetensors", "rate": null, "total": 3438167540, "unit": "B", "unit_divisor": 1000, "unit_scale": true }, "application/vnd.jupyter.widget-view+json": { "model_id": "0b793460322b4aa2bb8e247a36b40100", "version_major": 2, "version_minor": 0 }, "text/plain": [ "Downloading (โฆ)ch_model.safetensors: 0%| | 0.00/3.44G [00:00, ?B/s]" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "application/json": { "ascii": false, "bar_format": null, "colour": null, "elapsed": 0.004936933517456055, "initial": 0, "n": 0, "ncols": null, "nrows": null, "postfix": null, "prefix": "Downloading (โฆ)fe03/vae/config.json", "rate": null, "total": 547, "unit": "B", "unit_divisor": 1000, "unit_scale": true }, "application/vnd.jupyter.widget-view+json": { "model_id": "6bf2fb89b7694f8f88c50298b9a9154c", "version_major": 2, "version_minor": 0 }, "text/plain": [ "Downloading (โฆ)fe03/vae/config.json: 0%| | 0.00/547 [00:00, ?B/s]" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "application/json": { "ascii": false, "bar_format": null, "colour": null, "elapsed": 0.00596308708190918, "initial": 0, "n": 0, "ncols": null, "nrows": null, "postfix": null, "prefix": "Downloading (โฆ)on_pytorch_model.bin", "rate": null, "total": 334707217, "unit": "B", "unit_divisor": 1000, "unit_scale": true }, "application/vnd.jupyter.widget-view+json": { "model_id": "2bc6a35c3d8c43328b9c012801d687a5", "version_major": 2, "version_minor": 0 }, "text/plain": [ "Downloading (โฆ)on_pytorch_model.bin: 0%| | 0.00/335M [00:00, ?B/s]" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "application/json": { "ascii": false, "bar_format": null, "colour": null, "elapsed": 0.003246307373046875, "initial": 0, "n": 0, "ncols": null, "nrows": null, "postfix": null, "prefix": "Downloading (โฆ)ch_model.safetensors", "rate": null, "total": 334643276, "unit": "B", "unit_divisor": 1000, "unit_scale": true }, "application/vnd.jupyter.widget-view+json": { "model_id": "deeb18d8c7b542879088d128ab08d96c", "version_major": 2, "version_minor": 0 }, "text/plain": [ "Downloading (โฆ)ch_model.safetensors: 0%| | 0.00/335M [00:00, ?B/s]" ] }, "metadata": {}, "output_type": "display_data" }, { "name": "stderr", "output_type": "stream", "text": [ "`text_config_dict` is provided which will be used to initialize `CLIPTextConfig`. The value `text_config[\"id2label\"]` will be overriden.\n" ] } ], "source": [ "from diffusers import StableDiffusionPipeline\n", "from huggingface_hub import notebook_login, whoami\n", "\n", "try:\n", " whoami()\n", " print('Authorization token already provided')\n", "except (OSError, FileNotFoundError):\n", " notebook_login()\n", "\n", "pipe = StableDiffusionPipeline.from_pretrained('prompthero/openjourney')\n", "#pipe = StableDiffusionPipeline.from_pretrained('runwayml/stable-diffusion-v1-5')\n", "pipe = pipe.to('cpu')\n", "text_encoder = pipe.text_encoder\n", "text_encoder.eval()\n", "unet = pipe.unet\n", "unet.eval()\n", "vae = pipe.vae\n", "vae.eval()\n", "\n", "del pipe\n" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "## Convert models to OpenVINO Intermediate representation (IR) format\n", "\n", "OpenVINO supports PyTorch through export to the ONNX format. You will use `torch.onnx.export` function for obtaining ONNX model. You can learn more in the [PyTorch documentation](https://pytorch.org/docs/stable/onnx.html). You need to provide a model object, input data for model tracing and a path for saving the model. Optionally, you can provide the target onnx opset for conversion and other parameters specified in documentation (for example, input and output names or dynamic shapes).\n", "\n", "While ONNX models are directly supported by OpenVINOโข runtime, it can be useful to convert them to IR format to take advantage of advanced OpenVINO optimization tools and features. You will use OpenVINO Model Optimizer tool for conversion model to IR format and compression weights to `FP16` format.\n", "\n", "The model consists of three important parts:\n", "* Text Encoder for creation condition to generate image from text prompt.\n", "* Unet for step by step denoising latent image representation.\n", "* Autoencoder (VAE) for encdoing input image to latent space (if required) and decoding latent space to image back after generation.\n", "\n", "Let us convert each part." ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "### Text Encoder\n", "\n", "The text-encoder is responsible for transforming the input prompt, for example, \"a photo of an astronaut riding a horse\" into an embedding space that can be understood by the U-Net. It is usually a simple transformer-based encoder that maps a sequence of input tokens to a sequence of latent text embeddings.\n", "\n", "Input of the text encoder is the tensor `input_ids` which contains indexes of tokens from text processed by tokenizer and padded to maximum length accepted by model. Model outputs are two tensors: `last_hidden_state` - hidden state from the last MultiHeadAttention layer in the model and `pooler_out` - Pooled output for whole model hidden states. You will use `opset_version=14`, because model contains `triu` operation, supported in ONNX only starting from this opset." ] }, { "cell_type": "code", "execution_count": 14, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "U1cQFbVS0Ugf", "outputId": "9b3b825f-2adc-4d90-cae0-74890d368eab" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Text Encoder successfully converted to ONNX\n", "Warning: One or more of the values of the Constant can't fit in the float16 data type. Those values were casted to the nearest limit value, the model can produce incorrect results.\n", "Check for a new version of Intel(R) Distribution of OpenVINO(TM) toolkit here https://software.intel.com/content/www/us/en/develop/tools/openvino-toolkit/download.html?cid=other&source=prod&campid=ww_2023_bu_IOTG_OpenVINO-2022-3&content=upg_all&medium=organic or on https://github.com/openvinotoolkit/openvino\n", "[ INFO ] The model was converted to IR v11, the latest model format that corresponds to the source DL framework input/output format. While IR v11 is backwards compatible with OpenVINO Inference Engine API v1.0, please use API v2.0 (as of 2022.1) to take advantage of the latest improvements in IR v11.\n", "Find more information about API v2.0 and IR v11 at https://docs.openvino.ai/latest/openvino_2_0_transition_guide.html\n", "[ SUCCESS ] Generated IR version 11 model.\n", "[ SUCCESS ] XML file: /home/rainpole/stable_diffusion.openvino/sd2.1/text_encoder.xml\n", "[ SUCCESS ] BIN file: /home/rainpole/stable_diffusion.openvino/sd2.1/text_encoder.bin\n", "Text Encoder successfully converted to IR\n" ] }, { "data": { "text/plain": [ "0" ] }, "execution_count": 14, "metadata": {}, "output_type": "execute_result" } ], "source": [ "import gc\n", "from pathlib import Path\n", "import torch\n", "\n", "TEXT_ENCODER_ONNX_PATH = Path('text_encoder.onnx')\n", "TEXT_ENCODER_OV_PATH = TEXT_ENCODER_ONNX_PATH.with_suffix('.xml')\n", "\n", "\n", "def convert_encoder_onnx(xtext_encoder: StableDiffusionPipeline, onnx_path:Path):\n", " \"\"\"\n", " Convert Text Encoder model to ONNX. \n", " Function accepts pipeline, prepares example inputs for ONNX conversion via torch.export, \n", " Parameters: \n", " pipe (StableDiffusionPipeline): Stable Diffusion pipeline\n", " onnx_path (Path): File for storing onnx model\n", " Returns:\n", " None\n", " \"\"\"\n", " if not onnx_path.exists():\n", " input_ids = torch.ones((1, 77), dtype=torch.long)\n", " # switch model to inference mode\n", " text_encoder.eval()\n", "\n", " # disable gradients calculation for reducing memory consumption\n", " with torch.no_grad():\n", " # infer model, just to make sure that it works\n", " text_encoder(input_ids)\n", " # export model to ONNX format\n", " torch.onnx.export(\n", " text_encoder, # model instance\n", " input_ids, # inputs for model tracing\n", " onnx_path, # output file for saving result\n", " input_names=['tokens'], # model input name for onnx representation\n", " output_names=['last_hidden_state', 'pooler_out'], # model output names for onnx representation\n", " opset_version=14 # onnx opset version for export\n", " )\n", " print('Text Encoder successfully converted to ONNX')\n", " \n", "\n", "if not TEXT_ENCODER_OV_PATH.exists():\n", " convert_encoder_onnx(text_encoder, TEXT_ENCODER_ONNX_PATH)\n", " !mo --input_model $TEXT_ENCODER_ONNX_PATH --compress_to_fp16\n", " print('Text Encoder successfully converted to IR')\n", "else:\n", " print(f\"Text encoder will be loaded from {TEXT_ENCODER_OV_PATH}\")\n", "\n", "del text_encoder\n", "gc.collect()" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "### U-net\n", "\n", "Unet model has three inputs:\n", "* `sample` - latent image sample from previous step. Generation process has not been started yet, so you will use random noise.\n", "* `timestep` - current scheduler step.\n", "* `encoder_hidden_state` - hidden state of text encoder.\n", "\n", "Model predicts the `sample` state for the next step." ] }, { "cell_type": "code", "execution_count": 15, "metadata": { "id": "PscvJUXf_hpm" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Unet successfully converted to ONNX\n", "Check for a new version of Intel(R) Distribution of OpenVINO(TM) toolkit here https://software.intel.com/content/www/us/en/develop/tools/openvino-toolkit/download.html?cid=other&source=prod&campid=ww_2023_bu_IOTG_OpenVINO-2022-3&content=upg_all&medium=organic or on https://github.com/openvinotoolkit/openvino\n", "[ INFO ] The model was converted to IR v11, the latest model format that corresponds to the source DL framework input/output format. While IR v11 is backwards compatible with OpenVINO Inference Engine API v1.0, please use API v2.0 (as of 2022.1) to take advantage of the latest improvements in IR v11.\n", "Find more information about API v2.0 and IR v11 at https://docs.openvino.ai/latest/openvino_2_0_transition_guide.html\n", "[ SUCCESS ] Generated IR version 11 model.\n", "[ SUCCESS ] XML file: /home/rainpole/stable_diffusion.openvino/sd2.1/unet.xml\n", "[ SUCCESS ] BIN file: /home/rainpole/stable_diffusion.openvino/sd2.1/unet.bin\n", "Unet successfully converted to IR\n" ] }, { "data": { "text/plain": [ "0" ] }, "execution_count": 15, "metadata": {}, "output_type": "execute_result" } ], "source": [ "import numpy as np\n", "\n", "UNET_ONNX_PATH = Path('unet/unet.onnx')\n", "UNET_OV_PATH = UNET_ONNX_PATH.parents[1] / 'unet.xml'\n", "\n", "\n", "def convert_unet_onnx(unet:StableDiffusionPipeline, onnx_path:Path):\n", " \"\"\"\n", " Convert Unet model to ONNX, then IR format. \n", " Function accepts pipeline, prepares example inputs for ONNX conversion via torch.export, \n", " Parameters: \n", " pipe (StableDiffusionPipeline): Stable Diffusion pipeline\n", " onnx_path (Path): File for storing onnx model\n", " Returns:\n", " None\n", " \"\"\"\n", " if not onnx_path.exists():\n", " # prepare inputs\n", " encoder_hidden_state = torch.ones((2, 77, 768))\n", " latents_shape = (2, 4, 512 // 8, 512 // 8)\n", " latents = torch.randn(latents_shape)\n", " t = torch.from_numpy(np.array(1, dtype=float))\n", "\n", " # model size > 2Gb, it will be represented as onnx with external data files, you will store it in separated directory for avoid a lot of files in current directory\n", " onnx_path.parent.mkdir(exist_ok=True, parents=True)\n", " unet.eval()\n", "\n", " with torch.no_grad():\n", " torch.onnx.export(\n", " unet, \n", " (latents, t, encoder_hidden_state), str(onnx_path),\n", " input_names=['latent_model_input', 't', 'encoder_hidden_states'],\n", " output_names=['out_sample']\n", " )\n", " print('Unet successfully converted to ONNX')\n", "\n", "\n", "if not UNET_OV_PATH.exists():\n", " convert_unet_onnx(unet, UNET_ONNX_PATH)\n", " del unet\n", " gc.collect()\n", " !mo --input_model $UNET_ONNX_PATH --compress_to_fp16\n", " print('Unet successfully converted to IR')\n", "else:\n", " del unet\n", " print(f\"Unet will be loaded from {UNET_OV_PATH}\")\n", "gc.collect()" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "### VAE\n", "\n", "The VAE model has two parts, an encoder and a decoder. The encoder is used to convert the image into a low dimensional latent representation, which will serve as the input to the U-Net model. The decoder, conversely, transforms the latent representation back into an image.\n", "\n", "During latent diffusion training, the encoder is used to get the latent representations (latents) of the images for the forward diffusion process, which applies more and more noise at each step. During inference, the denoised latents generated by the reverse diffusion process are converted back into images using the VAE decoder. When you run inference for text-to-image, there is no initial image as a starting point. You can skip this step and directly generate initial random noise.\n", "\n", "As the encoder and the decoder are used independently in different parts of the pipeline, it will be better to convert them to separate models." ] }, { "cell_type": "code", "execution_count": 16, "metadata": { "id": "fQvd38qPHLtq" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "VAE encoder successfully converted to ONNX\n", "Check for a new version of Intel(R) Distribution of OpenVINO(TM) toolkit here https://software.intel.com/content/www/us/en/develop/tools/openvino-toolkit/download.html?cid=other&source=prod&campid=ww_2023_bu_IOTG_OpenVINO-2022-3&content=upg_all&medium=organic or on https://github.com/openvinotoolkit/openvino\n", "[ INFO ] The model was converted to IR v11, the latest model format that corresponds to the source DL framework input/output format. While IR v11 is backwards compatible with OpenVINO Inference Engine API v1.0, please use API v2.0 (as of 2022.1) to take advantage of the latest improvements in IR v11.\n", "Find more information about API v2.0 and IR v11 at https://docs.openvino.ai/latest/openvino_2_0_transition_guide.html\n", "[ SUCCESS ] Generated IR version 11 model.\n", "[ SUCCESS ] XML file: /home/rainpole/stable_diffusion.openvino/sd2.1/vae_encoder.xml\n", "[ SUCCESS ] BIN file: /home/rainpole/stable_diffusion.openvino/sd2.1/vae_encoder.bin\n", "VAE encoder successfully converted to IR\n", "VAE decoder successfully converted to ONNX\n", "Check for a new version of Intel(R) Distribution of OpenVINO(TM) toolkit here https://software.intel.com/content/www/us/en/develop/tools/openvino-toolkit/download.html?cid=other&source=prod&campid=ww_2023_bu_IOTG_OpenVINO-2022-3&content=upg_all&medium=organic or on https://github.com/openvinotoolkit/openvino\n", "[ INFO ] The model was converted to IR v11, the latest model format that corresponds to the source DL framework input/output format. While IR v11 is backwards compatible with OpenVINO Inference Engine API v1.0, please use API v2.0 (as of 2022.1) to take advantage of the latest improvements in IR v11.\n", "Find more information about API v2.0 and IR v11 at https://docs.openvino.ai/latest/openvino_2_0_transition_guide.html\n", "[ SUCCESS ] Generated IR version 11 model.\n", "[ SUCCESS ] XML file: /home/rainpole/stable_diffusion.openvino/sd2.1/vae_decoder.xml\n", "[ SUCCESS ] BIN file: /home/rainpole/stable_diffusion.openvino/sd2.1/vae_decoder.bin\n", "VAE decoder successfully converted to IR\n" ] } ], "source": [ "VAE_ENCODER_ONNX_PATH = Path('vae_encoder.onnx')\n", "VAE_ENCODER_OV_PATH = VAE_ENCODER_ONNX_PATH.with_suffix('.xml')\n", "\n", "\n", "def convert_vae_encoder_onnx(vae: StableDiffusionPipeline, onnx_path: Path):\n", " \"\"\"\n", " Convert VAE model to ONNX, then IR format. \n", " Function accepts pipeline, creates wrapper class for export only necessary for inference part, \n", " prepares example inputs for ONNX conversion via torch.export, \n", " Parameters: \n", " pipe (StableDiffusionInstructPix2PixPipeline): InstrcutPix2Pix pipeline\n", " onnx_path (Path): File for storing onnx model\n", " Returns:\n", " None\n", " \"\"\"\n", " class VAEEncoderWrapper(torch.nn.Module):\n", " def __init__(self, vae):\n", " super().__init__()\n", " self.vae = vae\n", "\n", " def forward(self, image):\n", " h = self.vae.encoder(image)\n", " moments = self.vae.quant_conv(h)\n", " return moments\n", "\n", " if not onnx_path.exists():\n", " vae_encoder = VAEEncoderWrapper(vae)\n", " vae_encoder.eval()\n", " image = torch.zeros((1, 3, 512, 512))\n", " with torch.no_grad():\n", " torch.onnx.export(vae_encoder, image, onnx_path, input_names=[\n", " 'init_image'], output_names=['image_latent'])\n", " print('VAE encoder successfully converted to ONNX')\n", "\n", "\n", "if not VAE_ENCODER_OV_PATH.exists():\n", " convert_vae_encoder_onnx(vae, VAE_ENCODER_ONNX_PATH)\n", " !mo --input_model $VAE_ENCODER_ONNX_PATH --compress_to_fp16\n", " print('VAE encoder successfully converted to IR')\n", "else:\n", " print(f\"VAE encoder will be loaded from {VAE_ENCODER_OV_PATH}\")\n", "\n", "VAE_DECODER_ONNX_PATH = Path('vae_decoder.onnx')\n", "VAE_DECODER_OV_PATH = VAE_DECODER_ONNX_PATH.with_suffix('.xml')\n", "\n", "\n", "def convert_vae_decoder_onnx(vae: StableDiffusionPipeline, onnx_path: Path):\n", " \"\"\"\n", " Convert VAE model to ONNX, then IR format. \n", " Function accepts pipeline, creates wrapper class for export only necessary for inference part, \n", " prepares example inputs for ONNX conversion via torch.export, \n", " Parameters: \n", " pipe (StableDiffusionInstructPix2PixPipeline): InstrcutPix2Pix pipeline\n", " onnx_path (Path): File for storing onnx model\n", " Returns:\n", " None\n", " \"\"\"\n", " class VAEDecoderWrapper(torch.nn.Module):\n", " def __init__(self, vae):\n", " super().__init__()\n", " self.vae = vae\n", "\n", " def forward(self, latents):\n", " latents = 1 / 0.18215 * latents \n", " return self.vae.decode(latents)\n", "\n", " if not onnx_path.exists():\n", " vae_decoder = VAEDecoderWrapper(vae)\n", " latents = torch.zeros((1, 4, 64, 64))\n", "\n", " vae_decoder.eval()\n", " with torch.no_grad():\n", " torch.onnx.export(vae_decoder, latents, onnx_path, input_names=[\n", " 'latents'], output_names=['sample'])\n", " print('VAE decoder successfully converted to ONNX')\n", "\n", "\n", "if not VAE_DECODER_OV_PATH.exists():\n", " convert_vae_decoder_onnx(vae, VAE_DECODER_ONNX_PATH)\n", " !mo --input_model $VAE_DECODER_ONNX_PATH --compress_to_fp16\n", " print('VAE decoder successfully converted to IR')\n", "else:\n", " print(f\"VAE decoder will be loaded from {VAE_DECODER_OV_PATH}\")\n", "\n", "del vae" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "\n", "## Prepare Inference Pipeline\n", "\n", "Putting it all together, let us now take a closer look at how the model works in inference by illustrating the logical flow." ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "\n", "
\n", "\n", "As you can see from the diagram, the only difference between Text-to-Image and text-guided Image-to-Image generation in approach is how initial latent state is generated. In case of Image-to-Image generation, you additionally have an image encoded by VAE encoder mixed with the noise produced by using latent seed, while in Text-to-Image you use only noise as initial latent state.\n", "The stable diffusion model takes both a latent image representation of size $64 \\times 64$ and a text prompt is transformed to text embeddings of size $77 \\times 768$ via CLIP's text encoder as an input.\n", "\n", "Next, the U-Net iteratively *denoises* the random latent image representations while being conditioned on the text embeddings. The output of the U-Net, being the noise residual, is used to compute a denoised latent image representation via a scheduler algorithm. Many different scheduler algorithms can be used for this computation, each having its pros and cons. For Stable Diffusion, it is recommended to use one of:\n", "\n", "- [PNDM scheduler](https://github.com/huggingface/diffusers/blob/main/src/diffusers/schedulers/scheduling_pndm.py)\n", "- [DDIM scheduler](https://github.com/huggingface/diffusers/blob/main/src/diffusers/schedulers/scheduling_ddim.py)\n", "- [K-LMS scheduler](https://github.com/huggingface/diffusers/blob/main/src/diffusers/schedulers/scheduling_lms_discrete.py)(you will use it in your pipeline)\n", "\n", "Theory on how the scheduler algorithm function works is out of scope for this notebook. Nonetheless, in short, you should remember that you compute the predicted denoised image representation from the previous noise representation and the predicted noise residual.\n", "For more information, refer to the recommended [Elucidating the Design Space of Diffusion-Based Generative Models](https://arxiv.org/abs/2206.00364)\n", "\n", "The *denoising* process is repeated given number of times (by default 50) to step-by-step retrieve better latent image representations.\n", "When complete, the latent image representation is decoded by the decoder part of the variational auto encoder." ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "2023-05-19 11:08:12.044722: I tensorflow/tsl/cuda/cudart_stub.cc:28] Could not find cuda drivers on your machine, GPU will not be used.\n", "2023-05-19 11:08:12.079869: I tensorflow/tsl/cuda/cudart_stub.cc:28] Could not find cuda drivers on your machine, GPU will not be used.\n", "2023-05-19 11:08:12.080328: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.\n", "To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.\n", "2023-05-19 11:08:12.605131: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT\n" ] }, { "data": { "text/html": [ "โญโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ Traceback (most recent call last) โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโฎ\n", "โ in <module>:59 โ\n", "โ โ\n", "โ 56 โ return image, {\"padding\": pad, \"src_width\": src_width, \"src_height\": src_height} โ\n", "โ 57 โ\n", "โ 58 โ\n", "โ โฑ 59 class OVStableDiffusionPipeline(DiffusionPipeline): โ\n", "โ 60 โ def __init__( โ\n", "โ 61 โ โ self, โ\n", "โ 62 โ โ vae_decoder: Model, โ\n", "โ โ\n", "โ in OVStableDiffusionPipeline:226 โ\n", "โ โ\n", "โ 223 โ โ image = self.postprocess_image(image, meta, output_type) โ\n", "โ 224 โ โ return {\"sample\": image, 'iterations': img_buffer} โ\n", "โ 225 โ โ\n", "โ โฑ 226 โ def prepare_latents(self, image:PIL.Image.Image = None, latent_timestep:torch.Tensor โ\n", "โ 227 โ โ \"\"\" โ\n", "โ 228 โ โ Function for getting initial latents for starting generation โ\n", "โ 229 โ\n", "โฐโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโฏ\n", "NameError: name 'torch' is not defined\n", "\n" ], "text/plain": [ "\u001b[31mโญโ\u001b[0m\u001b[31mโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ\u001b[0m\u001b[31m \u001b[0m\u001b[1;31mTraceback \u001b[0m\u001b[1;2;31m(most recent call last)\u001b[0m\u001b[31m \u001b[0m\u001b[31mโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ\u001b[0m\u001b[31mโโฎ\u001b[0m\n", "\u001b[31mโ\u001b[0m in \u001b[92m
โญโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ Traceback (most recent call last) โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโฎ\n", "โ in <module>:1 โ\n", "โ โ\n", "โ โฑ 1 final_image = result['sample'][0] โ\n", "โ 2 if result['iterations']: โ\n", "โ 3 โ all_frames = result['iterations'] โ\n", "โ 4 โ img = next(iter(all_frames)) โ\n", "โฐโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโฏ\n", "NameError: name 'result' is not defined\n", "\n" ], "text/plain": [ "\u001b[31mโญโ\u001b[0m\u001b[31mโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ\u001b[0m\u001b[31m \u001b[0m\u001b[1;31mTraceback \u001b[0m\u001b[1;2;31m(most recent call last)\u001b[0m\u001b[31m \u001b[0m\u001b[31mโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ\u001b[0m\u001b[31mโโฎ\u001b[0m\n", "\u001b[31mโ\u001b[0m in \u001b[92m