SamLowe commited on
Commit
3350320
·
1 Parent(s): f875c5b

Updated README.md

Browse files
Files changed (1) hide show
  1. README.md +80 -0
README.md CHANGED
@@ -1,3 +1,83 @@
1
  ---
2
  license: mit
3
  ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
  license: mit
3
  ---
4
+
5
+ This model is the ONNX version of [https://huggingface.co/SamLowe/roberta-base-go_emotions](https://huggingface.co/SamLowe/roberta-base-go_emotions).
6
+
7
+ ### Full precision ONNX version
8
+
9
+ `onnx/model.onnx` is the full precision ONNX version
10
+
11
+ - that has identical performance to the original transformers model
12
+ - and has the same model size (499MB)
13
+ - is faster than inference than normal Transformers, particularly for smaller batch sizes
14
+ - in my tests about 2x to 3x as fast for a batch size of 1 on a 8 core 11th gen i7 CPU using OnnxRuntime
15
+
16
+ ### Quaantized (INT8) ONNX version
17
+
18
+ `onnx/model_quantized.onnx` is the int8 quantized version
19
+
20
+ - that is one quarter the size (125MB) of the full precision model (above)
21
+ - but delivers almost all of the accuracy
22
+ - is faster than inference
23
+ - about 2x as fast for a batch size of 1 on an 8 core 11th gen i7 CPU using ONNXRuntime vs the full precision model above
24
+ - which makes it circa 5x as fast as the full precision normal Transformers model (on the above mentioned CPU, for a batch of 1)
25
+
26
+ ### How to use
27
+
28
+ #### Using Optimum Library ONNX Classes
29
+
30
+ To follow.
31
+
32
+ #### Using ONNXRuntime
33
+
34
+ - Tokenization can be done before with the `tokenizers` library,
35
+ - and then the fed into ONNXRuntime as the type of dict it uses,
36
+ - and then simply the postprocessing sigmoid is needed afterward on the model output (which comes as a numpy array) to create the embeddings.
37
+
38
+ ```python
39
+ from tokenizers import Tokenizer
40
+ import onnxruntime as ort
41
+
42
+ from os import cpu_count
43
+ import numpy as np # only used for the postprocessing sigmoid
44
+
45
+ sentences = ["hello world"] # for example a batch of 1
46
+
47
+ tokenizer = Tokenizer.from_pretrained("SamLowe/roberta-base-go_emotions")
48
+
49
+ # optional - set pad to only pad to longest in batch, not a fixed length. Without this, the model will run slower, esp for shorter input strings.
50
+ params = {**tokenizer.padding, "length": None}
51
+ tokenizer.enable_padding(**params)
52
+
53
+ tokens_obj = tokenizer.encode_batch(sentences)
54
+
55
+ def load_onnx_model(model_filepath):
56
+ _options = ort.SessionOptions()
57
+ _options.inter_op_num_threads, _options.intra_op_num_threads = cpu_count(), cpu_count()
58
+ _providers = ["CPUExecutionProvider"] # could use ort.get_available_providers()
59
+ return ort.InferenceSession(path_or_bytes=model_filepath, sess_options=_options, providers=_providers)
60
+
61
+ model = load_onnx_model("path_to_model_dot_onnx_or_model_quantized_dot_onnx")
62
+ output_names = [model.get_outputs()[0].name] # E.g. ["logits"]
63
+
64
+ model_input = {
65
+ "input_ids": [t.ids for t in tokens_obj],
66
+ "attention_mask": [t.attention_mask for t in tokens_obj]
67
+ }
68
+
69
+ def sigmoid(_outputs):
70
+ return 1.0 / (1.0 + np.exp(-_outputs))
71
+
72
+ model_output = model.run(
73
+ output_names=output_names,
74
+ input_feed=create_model_input(batch_sentences, model, verbose=False),
75
+ )[0]
76
+
77
+ embeddings = sigmoid(model_output)
78
+ print(embeddings)
79
+ ```
80
+
81
+ ### Example notebook: showing usage, accuracy & performance
82
+
83
+ Notebook with more details to follow.