File size: 2,926 Bytes
9afa0c2
 
 
3350320
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
79ec4c9
3350320
 
 
 
 
 
 
79ec4c9
3350320
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
---
license: mit
---

This model is the ONNX version of [https://huggingface.co/SamLowe/roberta-base-go_emotions](https://huggingface.co/SamLowe/roberta-base-go_emotions).

### Full precision ONNX version

`onnx/model.onnx` is the full precision ONNX version

- that has identical performance to the original transformers model
- and has the same model size (499MB)
- is faster than inference than normal Transformers, particularly for smaller batch sizes
  - in my tests about 2x to 3x as fast for a batch size of 1 on a 8 core 11th gen i7 CPU using OnnxRuntime

### Quaantized (INT8) ONNX version

`onnx/model_quantized.onnx` is the int8 quantized version 

- that is one quarter the size (125MB) of the full precision model (above)
- but delivers almost all of the accuracy
- is faster than inference
  - about 2x as fast for a batch size of 1 on an 8 core 11th gen i7 CPU using ONNXRuntime vs the full precision model above
  - which makes it circa 5x as fast as the full precision normal Transformers model (on the above mentioned CPU, for a batch of 1)

### How to use

#### Using Optimum Library ONNX Classes

To follow.

#### Using ONNXRuntime

- Tokenization can be done before with the `tokenizers` library,
- and then the fed into ONNXRuntime as the type of dict it uses,
- and then simply the postprocessing sigmoid is needed afterward on the model output (which comes as a numpy array) to create the embeddings.

```python
from tokenizers import Tokenizer
import onnxruntime as ort

from os import cpu_count
import numpy as np  # only used for the postprocessing sigmoid

sentences = ["hello world"]  # for example a batch of 1

tokenizer = Tokenizer.from_pretrained("SamLowe/roberta-base-go_emotions")

# optional - set pad to only pad to longest in batch, not a fixed length. Without this, the model will run slower, esp for shorter input strings.
params = {**tokenizer.padding, "length": None}
tokenizer.enable_padding(**params)

tokens_obj = tokenizer.encode_batch(sentences)

def load_onnx_model(model_filepath):
    _options = ort.SessionOptions()
    _options.inter_op_num_threads, _options.intra_op_num_threads = cpu_count(), cpu_count()
    _providers = ["CPUExecutionProvider"]  # could use ort.get_available_providers()
    return ort.InferenceSession(path_or_bytes=model_filepath, sess_options=_options, providers=_providers)

model = load_onnx_model("path_to_model_dot_onnx_or_model_quantized_dot_onnx")
output_names = [model.get_outputs()[0].name]  # E.g. ["logits"]

input_feed_dict = {
  "input_ids": [t.ids for t in tokens_obj],
  "attention_mask": [t.attention_mask for t in tokens_obj]
}

def sigmoid(_outputs):
  return 1.0 / (1.0 + np.exp(-_outputs))

model_output = model.run(output_names=output_names, input_feed=input_feed_dict)[0]

embeddings = sigmoid(model_output)
print(embeddings)
```

### Example notebook: showing usage, accuracy & performance

Notebook with more details to follow.