---
license: mit
tags:
- sentence-embeddings
- endpoints-template
- optimum
library_name: generic
---

# Optimized and Quantized [sentence-transformers/all-MiniLM-L6-v2](https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2) with a custom pipeline.py


This repository implements a `custom` task for `sentence-embeddings` for 🤗 Inference Endpoints for accelerated inference using [🤗 Optimum](https://huggingface.co/docs/optimum/index). The code for the customized pipeline is in the [pipeline.py](https://huggingface.co/philschmid/all-MiniLM-L6-v2-optimum-embeddings/blob/main/pipeline.py).

In the [how to create your own optimized and quantized model](#how-to-create-your-own-optimized-and-quantized-model) you will learn how the model was converted & optimized, it is based on the [Accelerate Sentence Transformers with Hugging Face Optimum](https://www.philschmid.de/optimize-sentence-transformers) blog post. It also includes how to create your custom pipeline and test it. There is also a [notebook](https://huggingface.co/philschmid/all-MiniLM-L6-v2-optimum-embeddings/blob/main/convert.ipynb) included.

To use deploy this model a an Inference Endpoint you have to select `Custom` as task to use the `pipeline.py` file. -> _double check if it is selected_

### expected Request payload

```json
{
  "inputs": "The sky is a blue today and not gray", 
}
```

below is an example on how to run a request using Python and `requests`.

## Run Request 

```python
import json
from typing import List
import requests as r
import base64

ENDPOINT_URL = ""
HF_TOKEN = ""


def predict(document_string:str=None):

    payload = {"inputs": document_string}
    response = r.post(
        ENDPOINT_URL, headers={"Authorization": f"Bearer {HF_TOKEN}"}, json=payload
    )
    return response.json()


prediction = predict(
    path_to_image="The sky is a blue today and not gray"
)
```

expected output

```python
{'embeddings': [[-0.021580450236797333,
   0.021715054288506508,
   0.00979710929095745,
   -0.0005379787762649357,
   0.04682469740509987,
   -0.013600599952042103,
   ...
}
```


## How to create your own optimized and quantized model

Steps:
[1. Convert model to ONNX](#1-convert-model-to-onnx)  
[2. Optimize & quantize model with Optimum](#2-optimize--quantize-model-with-optimum)  
[3. Create Custom Handler for Inference Endpoints](#3-create-custom-handler-for-inference-endpoints)  

Helpful links:
* [Accelerate Sentence Transformers with Hugging Face Optimum](https://www.philschmid.de/optimize-sentence-transformers)
* [Create Custom Handler Endpoints](https://link-to-docs)

## Setup & Installation

```python
%%writefile requirements.txt
optimum[onnxruntime]==1.3.0
mkl-include
mkl
```

install requirements

```python
!pip install -r requirements.txt
```

## 1. Convert model to ONNX


```python
from optimum.onnxruntime import ORTModelForFeatureExtraction
from transformers import AutoTokenizer
from pathlib import Path


model_id="sentence-transformers/all-MiniLM-L6-v2"
onnx_path = Path(".")

# load vanilla transformers and convert to onnx
model = ORTModelForFeatureExtraction.from_pretrained(model_id, from_transformers=True)
tokenizer = AutoTokenizer.from_pretrained(model_id)

# save onnx checkpoint and tokenizer
model.save_pretrained(onnx_path)
tokenizer.save_pretrained(onnx_path)
```


## 2. Optimize & quantize model with Optimum


```python
from optimum.onnxruntime import ORTOptimizer, ORTQuantizer
from optimum.onnxruntime.configuration import OptimizationConfig, AutoQuantizationConfig

# create ORTOptimizer and define optimization configuration
optimizer = ORTOptimizer.from_pretrained(model_id, feature=model.pipeline_task)
optimization_config = OptimizationConfig(optimization_level=99) # enable all optimizations

# apply the optimization configuration to the model
optimizer.export(
    onnx_model_path=onnx_path / "model.onnx",
    onnx_optimized_model_output_path=onnx_path / "model-optimized.onnx",
    optimization_config=optimization_config,
)


# create ORTQuantizer and define quantization configuration
dynamic_quantizer = ORTQuantizer.from_pretrained(model_id, feature=model.pipeline_task)
dqconfig = AutoQuantizationConfig.avx512_vnni(is_static=False, per_channel=False)

# apply the quantization configuration to the model
model_quantized_path = dynamic_quantizer.export(
    onnx_model_path=onnx_path / "model-optimized.onnx",
    onnx_quantized_model_output_path=onnx_path / "model-quantized.onnx",
    quantization_config=dqconfig,
)


```

## 3. Create Custom Handler for Inference Endpoints


```python
%%writefile pipeline.py
from typing import  Dict, List, Any
from optimum.onnxruntime import ORTModelForFeatureExtraction
from transformers import AutoTokenizer
import torch.nn.functional as F
import torch

# copied from the model card
def mean_pooling(model_output, attention_mask):
    token_embeddings = model_output[0] #First element of model_output contains all token embeddings
    input_mask_expanded = attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float()
    return torch.sum(token_embeddings * input_mask_expanded, 1) / torch.clamp(input_mask_expanded.sum(1), min=1e-9)


class PreTrainedPipeline():
    def __init__(self, path=""):
        # load the optimized model
        self.model = ORTModelForFeatureExtraction.from_pretrained(path, file_name="model-quantized.onnx")
        self.tokenizer = AutoTokenizer.from_pretrained(path)

    def __call__(self, data: Any) -> List[List[Dict[str, float]]]:
        """
        Args:
            data (:obj:):
                includes the input data and the parameters for the inference.
        Return:
            A :obj:`list`:. The list contains the embeddings of the inference inputs
        """
        inputs = data.get("inputs", data)

        # tokenize the input
        encoded_inputs = self.tokenizer(inputs, padding=True, truncation=True, return_tensors='pt')
        # run the model
        outputs = self.model(**encoded_inputs)
        # Perform pooling
        sentence_embeddings = mean_pooling(outputs, encoded_inputs['attention_mask'])
        # Normalize embeddings
        sentence_embeddings = F.normalize(sentence_embeddings, p=2, dim=1)
        # postprocess the prediction
        return {"embeddings": sentence_embeddings.tolist()}
```

test custom pipeline


```python
from pipeline import PreTrainedPipeline

# init handler
my_handler = PreTrainedPipeline(path=".")

# prepare sample payload
request = {"inputs": "I am quite excited how this will turn out"}

# test the handler
%timeit my_handler(request)

```

results

```
1.55 ms ± 2.04 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
```