{ "cells": [ { "cell_type": "code", "execution_count": null, "id": "901c8ef3", "metadata": {}, "outputs": [], "source": [ "# Copyright (c) Meta Platforms, Inc. and affiliates." ] }, { "cell_type": "markdown", "id": "1662bb7c", "metadata": {}, "source": [ "# Produces masks from prompts using an ONNX model" ] }, { "cell_type": "markdown", "id": "7fcc21a0", "metadata": {}, "source": [ "SAM's prompt encoder and mask decoder are very lightweight, which allows for efficient computation of a mask given user input. This notebook shows an example of how to export and use this lightweight component of the model in ONNX format, allowing it to run on a variety of platforms that support an ONNX runtime." ] }, { "cell_type": "code", "execution_count": 4, "id": "86daff77", "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", "\n", " \"Open\n", "\n" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "from IPython.display import display, HTML\n", "display(HTML(\n", "\"\"\"\n", "\n", " \"Open\n", "\n", "\"\"\"\n", "))" ] }, { "cell_type": "markdown", "id": "55ae4e00", "metadata": {}, "source": [ "## Environment Set-up" ] }, { "cell_type": "markdown", "id": "109a5cc2", "metadata": {}, "source": [ "If running locally using jupyter, first install `segment_anything` in your environment using the [installation instructions](https://github.com/facebookresearch/segment-anything#installation) in the repository. The latest stable versions of PyTorch and ONNX are recommended for this notebook. If running from Google Colab, set `using_collab=True` below and run the cell. In Colab, be sure to select 'GPU' under 'Edit'->'Notebook Settings'->'Hardware accelerator'." ] }, { "cell_type": "code", "execution_count": 5, "id": "39b99fc4", "metadata": {}, "outputs": [], "source": [ "using_colab = False" ] }, { "cell_type": "code", "execution_count": 6, "id": "296a69be", "metadata": {}, "outputs": [], "source": [ "if using_colab:\n", " import torch\n", " import torchvision\n", " print(\"PyTorch version:\", torch.__version__)\n", " print(\"Torchvision version:\", torchvision.__version__)\n", " print(\"CUDA is available:\", torch.cuda.is_available())\n", " import sys\n", " !{sys.executable} -m pip install opencv-python matplotlib onnx onnxruntime\n", " !{sys.executable} -m pip install 'git+https://github.com/facebookresearch/segment-anything.git'\n", " \n", " !mkdir images\n", " !wget -P images https://raw.githubusercontent.com/facebookresearch/segment-anything/main/notebooks/images/truck.jpg\n", " \n", " !wget https://dl.fbaipublicfiles.com/segment_anything/sam_vit_h_4b8939.pth" ] }, { "cell_type": "markdown", "id": "dc4a58be", "metadata": {}, "source": [ "## Set-up" ] }, { "cell_type": "markdown", "id": "42396e8d", "metadata": {}, "source": [ "Note that this notebook requires both the `onnx` and `onnxruntime` optional dependencies, in addition to `opencv-python` and `matplotlib` for visualization." ] }, { "cell_type": "code", "execution_count": null, "id": "2c712610", "metadata": {}, "outputs": [], "source": [ "import torch\n", "import numpy as np\n", "import cv2\n", "import matplotlib.pyplot as plt\n", "from segment_anything import sam_model_registry, SamPredictor\n", "from segment_anything.utils.onnx import SamOnnxModel\n", "\n", "import onnxruntime\n", "from onnxruntime.quantization import QuantType\n", "from onnxruntime.quantization.quantize import quantize_dynamic" ] }, { "cell_type": "code", "execution_count": null, "id": "f29441b9", "metadata": {}, "outputs": [], "source": [ "def show_mask(mask, ax):\n", " color = np.array([30/255, 144/255, 255/255, 0.6])\n", " h, w = mask.shape[-2:]\n", " mask_image = mask.reshape(h, w, 1) * color.reshape(1, 1, -1)\n", " ax.imshow(mask_image)\n", " \n", "def show_points(coords, labels, ax, marker_size=375):\n", " pos_points = coords[labels==1]\n", " neg_points = coords[labels==0]\n", " ax.scatter(pos_points[:, 0], pos_points[:, 1], color='green', marker='*', s=marker_size, edgecolor='white', linewidth=1.25)\n", " ax.scatter(neg_points[:, 0], neg_points[:, 1], color='red', marker='*', s=marker_size, edgecolor='white', linewidth=1.25) \n", " \n", "def show_box(box, ax):\n", " x0, y0 = box[0], box[1]\n", " w, h = box[2] - box[0], box[3] - box[1]\n", " ax.add_patch(plt.Rectangle((x0, y0), w, h, edgecolor='green', facecolor=(0,0,0,0), lw=2)) " ] }, { "cell_type": "markdown", "id": "bd0f6b2b", "metadata": {}, "source": [ "## Export an ONNX model" ] }, { "cell_type": "markdown", "id": "1540f719", "metadata": {}, "source": [ "Set the path below to a SAM model checkpoint, then load the model. This will be needed to both export the model and to calculate embeddings for the model." ] }, { "cell_type": "code", "execution_count": null, "id": "76fc53f4", "metadata": {}, "outputs": [], "source": [ "checkpoint = \"sam_vit_h_4b8939.pth\"\n", "model_type = \"default\"" ] }, { "cell_type": "code", "execution_count": null, "id": "11bfc8aa", "metadata": {}, "outputs": [], "source": [ "sam = sam_model_registry[model_type](checkpoint=checkpoint)" ] }, { "cell_type": "markdown", "id": "450c089c", "metadata": {}, "source": [ "The script `segment-anything/scripts/export_onnx_model.py` can be used to export the necessary portion of SAM. Alternatively, run the following code to export an ONNX model. If you have already exported a model, set the path below and skip to the next section. Assure that the exported ONNX model aligns with the checkpoint and model type set above. This notebook expects the model was exported with the parameter `return_single_mask=True`." ] }, { "cell_type": "code", "execution_count": null, "id": "38a8add8", "metadata": {}, "outputs": [], "source": [ "onnx_model_path = None # Set to use an already exported model, then skip to the next section." ] }, { "cell_type": "code", "execution_count": null, "id": "7da638ba", "metadata": { "scrolled": false }, "outputs": [], "source": [ "import warnings\n", "\n", "onnx_model_path = \"sam_onnx_example.onnx\"\n", "\n", "onnx_model = SamOnnxModel(sam, return_single_mask=True)\n", "\n", "dynamic_axes = {\n", " \"point_coords\": {1: \"num_points\"},\n", " \"point_labels\": {1: \"num_points\"},\n", "}\n", "\n", "embed_dim = sam.prompt_encoder.embed_dim\n", "embed_size = sam.prompt_encoder.image_embedding_size\n", "mask_input_size = [4 * x for x in embed_size]\n", "dummy_inputs = {\n", " \"image_embeddings\": torch.randn(1, embed_dim, *embed_size, dtype=torch.float),\n", " \"point_coords\": torch.randint(low=0, high=1024, size=(1, 5, 2), dtype=torch.float),\n", " \"point_labels\": torch.randint(low=0, high=4, size=(1, 5), dtype=torch.float),\n", " \"mask_input\": torch.randn(1, 1, *mask_input_size, dtype=torch.float),\n", " \"has_mask_input\": torch.tensor([1], dtype=torch.float),\n", " \"orig_im_size\": torch.tensor([1500, 2250], dtype=torch.float),\n", "}\n", "output_names = [\"masks\", \"iou_predictions\", \"low_res_masks\"]\n", "\n", "with warnings.catch_warnings():\n", " warnings.filterwarnings(\"ignore\", category=torch.jit.TracerWarning)\n", " warnings.filterwarnings(\"ignore\", category=UserWarning)\n", " with open(onnx_model_path, \"wb\") as f:\n", " torch.onnx.export(\n", " onnx_model,\n", " tuple(dummy_inputs.values()),\n", " f,\n", " export_params=True,\n", " verbose=False,\n", " opset_version=17,\n", " do_constant_folding=True,\n", " input_names=list(dummy_inputs.keys()),\n", " output_names=output_names,\n", " dynamic_axes=dynamic_axes,\n", " ) " ] }, { "cell_type": "markdown", "id": "c450cf1a", "metadata": {}, "source": [ "If desired, the model can additionally be quantized and optimized. We find this improves web runtime significantly for negligible change in qualitative performance. Run the next cell to quantize the model, or skip to the next section otherwise." ] }, { "cell_type": "code", "execution_count": null, "id": "235d39fe", "metadata": {}, "outputs": [], "source": [ "onnx_model_quantized_path = \"sam_onnx_quantized_example.onnx\"\n", "quantize_dynamic(\n", " model_input=onnx_model_path,\n", " model_output=onnx_model_quantized_path,\n", " optimize_model=True,\n", " per_channel=False,\n", " reduce_range=False,\n", " weight_type=QuantType.QUInt8,\n", ")\n", "onnx_model_path = onnx_model_quantized_path" ] }, { "cell_type": "markdown", "id": "927a928b", "metadata": {}, "source": [ "## Example Image" ] }, { "cell_type": "code", "execution_count": null, "id": "6be6eb55", "metadata": {}, "outputs": [], "source": [ "image = cv2.imread('images/truck.jpg')\n", "image = cv2.cvtColor(image, cv2.COLOR_BGR2RGB)" ] }, { "cell_type": "code", "execution_count": null, "id": "b7e9a27a", "metadata": {}, "outputs": [], "source": [ "plt.figure(figsize=(10,10))\n", "plt.imshow(image)\n", "plt.axis('on')\n", "plt.show()" ] }, { "cell_type": "markdown", "id": "027b177b", "metadata": {}, "source": [ "## Using an ONNX model" ] }, { "cell_type": "markdown", "id": "778d4593", "metadata": {}, "source": [ "Here as an example, we use `onnxruntime` in python on CPU to execute the ONNX model. However, any platform that supports an ONNX runtime could be used in principle. Launch the runtime session below:" ] }, { "cell_type": "code", "execution_count": null, "id": "9689b1bf", "metadata": {}, "outputs": [], "source": [ "ort_session = onnxruntime.InferenceSession(onnx_model_path)" ] }, { "cell_type": "markdown", "id": "7708ead6", "metadata": {}, "source": [ "To use the ONNX model, the image must first be pre-processed using the SAM image encoder. This is a heavier weight process best performed on GPU. SamPredictor can be used as normal, then `.get_image_embedding()` will retreive the intermediate features." ] }, { "cell_type": "code", "execution_count": null, "id": "26e067b4", "metadata": {}, "outputs": [], "source": [ "sam.to(device='cuda')\n", "predictor = SamPredictor(sam)" ] }, { "cell_type": "code", "execution_count": null, "id": "7ad3f0d6", "metadata": {}, "outputs": [], "source": [ "predictor.set_image(image)" ] }, { "cell_type": "code", "execution_count": null, "id": "8a6f0f07", "metadata": {}, "outputs": [], "source": [ "image_embedding = predictor.get_image_embedding().cpu().numpy()" ] }, { "cell_type": "code", "execution_count": null, "id": "5e112f33", "metadata": {}, "outputs": [], "source": [ "image_embedding.shape" ] }, { "cell_type": "markdown", "id": "6337b654", "metadata": {}, "source": [ "The ONNX model has a different input signature than `SamPredictor.predict`. The following inputs must all be supplied. Note the special cases for both point and mask inputs. All inputs are `np.float32`.\n", "* `image_embeddings`: The image embedding from `predictor.get_image_embedding()`. Has a batch index of length 1.\n", "* `point_coords`: Coordinates of sparse input prompts, corresponding to both point inputs and box inputs. Boxes are encoded using two points, one for the top-left corner and one for the bottom-right corner. *Coordinates must already be transformed to long-side 1024.* Has a batch index of length 1.\n", "* `point_labels`: Labels for the sparse input prompts. 0 is a negative input point, 1 is a positive input point, 2 is a top-left box corner, 3 is a bottom-right box corner, and -1 is a padding point. *If there is no box input, a single padding point with label -1 and coordinates (0.0, 0.0) should be concatenated.*\n", "* `mask_input`: A mask input to the model with shape 1x1x256x256. This must be supplied even if there is no mask input. In this case, it can just be zeros.\n", "* `has_mask_input`: An indicator for the mask input. 1 indicates a mask input, 0 indicates no mask input.\n", "* `orig_im_size`: The size of the input image in (H,W) format, before any transformation. \n", "\n", "Additionally, the ONNX model does not threshold the output mask logits. To obtain a binary mask, threshold at `sam.mask_threshold` (equal to 0.0)." ] }, { "cell_type": "markdown", "id": "bf5a9f55", "metadata": {}, "source": [ "### Example point input" ] }, { "cell_type": "code", "execution_count": null, "id": "1c0deef0", "metadata": {}, "outputs": [], "source": [ "input_point = np.array([[500, 375]])\n", "input_label = np.array([1])" ] }, { "cell_type": "markdown", "id": "7256394c", "metadata": {}, "source": [ "Add a batch index, concatenate a padding point, and transform." ] }, { "cell_type": "code", "execution_count": null, "id": "4f69903e", "metadata": {}, "outputs": [], "source": [ "onnx_coord = np.concatenate([input_point, np.array([[0.0, 0.0]])], axis=0)[None, :, :]\n", "onnx_label = np.concatenate([input_label, np.array([-1])], axis=0)[None, :].astype(np.float32)\n", "\n", "onnx_coord = predictor.transform.apply_coords(onnx_coord, image.shape[:2]).astype(np.float32)\n" ] }, { "cell_type": "markdown", "id": "b188dc53", "metadata": {}, "source": [ "Create an empty mask input and an indicator for no mask." ] }, { "cell_type": "code", "execution_count": null, "id": "5cb52bcf", "metadata": {}, "outputs": [], "source": [ "onnx_mask_input = np.zeros((1, 1, 256, 256), dtype=np.float32)\n", "onnx_has_mask_input = np.zeros(1, dtype=np.float32)" ] }, { "cell_type": "markdown", "id": "a99c2cc5", "metadata": {}, "source": [ "Package the inputs to run in the onnx model" ] }, { "cell_type": "code", "execution_count": null, "id": "b1d7ea11", "metadata": {}, "outputs": [], "source": [ "ort_inputs = {\n", " \"image_embeddings\": image_embedding,\n", " \"point_coords\": onnx_coord,\n", " \"point_labels\": onnx_label,\n", " \"mask_input\": onnx_mask_input,\n", " \"has_mask_input\": onnx_has_mask_input,\n", " \"orig_im_size\": np.array(image.shape[:2], dtype=np.float32)\n", "}" ] }, { "cell_type": "markdown", "id": "4b6409c9", "metadata": {}, "source": [ "Predict a mask and threshold it." ] }, { "cell_type": "code", "execution_count": null, "id": "dc4cc082", "metadata": { "scrolled": false }, "outputs": [], "source": [ "masks, _, low_res_logits = ort_session.run(None, ort_inputs)\n", "masks = masks > predictor.model.mask_threshold" ] }, { "cell_type": "code", "execution_count": null, "id": "d778a8fb", "metadata": {}, "outputs": [], "source": [ "masks.shape" ] }, { "cell_type": "code", "execution_count": null, "id": "badb1175", "metadata": {}, "outputs": [], "source": [ "plt.figure(figsize=(10,10))\n", "plt.imshow(image)\n", "show_mask(masks, plt.gca())\n", "show_points(input_point, input_label, plt.gca())\n", "plt.axis('off')\n", "plt.show() " ] }, { "cell_type": "markdown", "id": "1f1d4d15", "metadata": {}, "source": [ "### Example mask input" ] }, { "cell_type": "code", "execution_count": null, "id": "b319da82", "metadata": {}, "outputs": [], "source": [ "input_point = np.array([[500, 375], [1125, 625]])\n", "input_label = np.array([1, 1])\n", "\n", "# Use the mask output from the previous run. It is already in the correct form for input to the ONNX model.\n", "onnx_mask_input = low_res_logits" ] }, { "cell_type": "markdown", "id": "b1823b37", "metadata": {}, "source": [ "Transform the points as in the previous example." ] }, { "cell_type": "code", "execution_count": null, "id": "8885130f", "metadata": {}, "outputs": [], "source": [ "onnx_coord = np.concatenate([input_point, np.array([[0.0, 0.0]])], axis=0)[None, :, :]\n", "onnx_label = np.concatenate([input_label, np.array([-1])], axis=0)[None, :].astype(np.float32)\n", "\n", "onnx_coord = predictor.transform.apply_coords(onnx_coord, image.shape[:2]).astype(np.float32)" ] }, { "cell_type": "markdown", "id": "28e47b69", "metadata": {}, "source": [ "The `has_mask_input` indicator is now 1." ] }, { "cell_type": "code", "execution_count": null, "id": "3ab4483a", "metadata": {}, "outputs": [], "source": [ "onnx_has_mask_input = np.ones(1, dtype=np.float32)" ] }, { "cell_type": "markdown", "id": "d3781955", "metadata": {}, "source": [ "Package inputs, then predict and threshold the mask." ] }, { "cell_type": "code", "execution_count": null, "id": "0c1ec096", "metadata": {}, "outputs": [], "source": [ "ort_inputs = {\n", " \"image_embeddings\": image_embedding,\n", " \"point_coords\": onnx_coord,\n", " \"point_labels\": onnx_label,\n", " \"mask_input\": onnx_mask_input,\n", " \"has_mask_input\": onnx_has_mask_input,\n", " \"orig_im_size\": np.array(image.shape[:2], dtype=np.float32)\n", "}\n", "\n", "masks, _, _ = ort_session.run(None, ort_inputs)\n", "masks = masks > predictor.model.mask_threshold" ] }, { "cell_type": "code", "execution_count": null, "id": "1e36554b", "metadata": {}, "outputs": [], "source": [ "plt.figure(figsize=(10,10))\n", "plt.imshow(image)\n", "show_mask(masks, plt.gca())\n", "show_points(input_point, input_label, plt.gca())\n", "plt.axis('off')\n", "plt.show() " ] }, { "cell_type": "markdown", "id": "2ef211d0", "metadata": {}, "source": [ "### Example box and point input" ] }, { "cell_type": "code", "execution_count": null, "id": "51e58d2e", "metadata": {}, "outputs": [], "source": [ "input_box = np.array([425, 600, 700, 875])\n", "input_point = np.array([[575, 750]])\n", "input_label = np.array([0])" ] }, { "cell_type": "markdown", "id": "6e119dcb", "metadata": {}, "source": [ "Add a batch index, concatenate a box and point inputs, add the appropriate labels for the box corners, and transform. There is no padding point since the input includes a box input." ] }, { "cell_type": "code", "execution_count": null, "id": "bfbe4911", "metadata": {}, "outputs": [], "source": [ "onnx_box_coords = input_box.reshape(2, 2)\n", "onnx_box_labels = np.array([2,3])\n", "\n", "onnx_coord = np.concatenate([input_point, onnx_box_coords], axis=0)[None, :, :]\n", "onnx_label = np.concatenate([input_label, onnx_box_labels], axis=0)[None, :].astype(np.float32)\n", "\n", "onnx_coord = predictor.transform.apply_coords(onnx_coord, image.shape[:2]).astype(np.float32)" ] }, { "cell_type": "markdown", "id": "65edabd2", "metadata": {}, "source": [ "Package inputs, then predict and threshold the mask." ] }, { "cell_type": "code", "execution_count": null, "id": "2abfba56", "metadata": {}, "outputs": [], "source": [ "onnx_mask_input = np.zeros((1, 1, 256, 256), dtype=np.float32)\n", "onnx_has_mask_input = np.zeros(1, dtype=np.float32)\n", "\n", "ort_inputs = {\n", " \"image_embeddings\": image_embedding,\n", " \"point_coords\": onnx_coord,\n", " \"point_labels\": onnx_label,\n", " \"mask_input\": onnx_mask_input,\n", " \"has_mask_input\": onnx_has_mask_input,\n", " \"orig_im_size\": np.array(image.shape[:2], dtype=np.float32)\n", "}\n", "\n", "masks, _, _ = ort_session.run(None, ort_inputs)\n", "masks = masks > predictor.model.mask_threshold" ] }, { "cell_type": "code", "execution_count": null, "id": "8301bf33", "metadata": {}, "outputs": [], "source": [ "plt.figure(figsize=(10, 10))\n", "plt.imshow(image)\n", "show_mask(masks[0], plt.gca())\n", "show_box(input_box, plt.gca())\n", "show_points(input_point, input_label, plt.gca())\n", "plt.axis('off')\n", "plt.show()" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.10.10" } }, "nbformat": 4, "nbformat_minor": 5 }