File size: 6,093 Bytes
6828a94
3ce8b81
 
 
 
 
 
 
 
 
6828a94
3ce8b81
 
 
6828a94
3ce8b81
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
2cc4559
88aa380
 
3ce8b81
 
 
 
 
 
 
 
 
 
93782e8
3ce8b81
 
 
 
 
 
 
93782e8
3ce8b81
 
 
 
 
 
 
88aa380
 
3ce8b81
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
---
language: en
tags:
- text-classification
- onnx
- emotions
- multi-class-classification
- multi-label-classification
datasets:
- go_emotions
license: mit
inference: false
widget:
- text: ONNX is so much faster, its very handy!
---

### Overview

This is a multi-label, multi-class linear classifer for emotions that works with [sentence-transformers/all-MiniLM-L6-v2](https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2), having been trained on the [go_emotions](https://huggingface.co/datasets/go_emotions) dataset.

### Labels

The 28 labels from the [go_emotions](https://huggingface.co/datasets/go_emotions) dataset are:
```
['admiration', 'amusement', 'anger', 'annoyance', 'approval', 'caring', 'confusion', 'curiosity', 'desire', 'disappointment', 'disapproval', 'disgust', 'embarrassment', 'excitement', 'fear', 'gratitude', 'grief', 'joy', 'love', 'nervousness', 'optimism', 'pride', 'realization', 'relief', 'remorse', 'sadness', 'surprise', 'neutral']
```

### Metrics (exact match of labels per item)

This is a multi-label, multi-class dataset, so each label is effectively a separate binary classification. Evaluating across all labels per item in the go_emotions test split the metrics are shown below.

Optimising the threshold per label to optimise the F1 metric, the metrics (evaluated on the go_emotions test split) are:

- Precision: 0.384
- Recall: 0.438
- F1: 0.397

Weighted by the relative support of each label in the dataset, this is:

- Precision: 0.443
- Recall: 0.552
- F1: 0.484

Using a fixed threshold of 0.5 to convert the scores to binary predictions for each label, the metrics (evaluated on the go_emotions test split, and unweighted by support) are:

- Precision: 0.551
- Recall: 0.211
- F1: 0.261

### Metrics (per-label)

This is a multi-label, multi-class dataset, so each label is effectively a separate binary classification and metrics are better measured per label.

Optimising the threshold per label to optimise the F1 metric, the metrics (evaluated on the go_emotions test split) are:
|                |    f1 | precision | recall | support | threshold |
| -------------- | ----- | --------- | ------ | ------- | --------- |
| admiration     | 0.529 |     0.499 |  0.563 |     504 |      0.25 |
| amusement      | 0.733 |     0.672 |  0.807 |     264 |      0.20 |
| anger          | 0.394 |     0.363 |  0.429 |     198 |      0.15 |
| annoyance      | 0.293 |     0.252 |  0.350 |     320 |      0.15 |
| approval       | 0.292 |     0.345 |  0.254 |     351 |      0.20 |
| caring         | 0.320 |     0.270 |  0.393 |     135 |      0.15 |
| confusion      | 0.291 |     0.276 |  0.307 |     153 |      0.15 |
| curiosity      | 0.366 |     0.307 |  0.454 |     284 |      0.15 |
| desire         | 0.317 |     0.269 |  0.386 |      83 |      0.15 |
| disappointment | 0.159 |     0.127 |  0.212 |     151 |      0.10 |
| disapproval    | 0.306 |     0.341 |  0.277 |     267 |      0.20 |
| disgust        | 0.405 |     0.412 |  0.398 |     123 |      0.20 |
| embarrassment  | 0.364 |     0.414 |  0.324 |      37 |      0.35 |
| excitement     | 0.296 |     0.232 |  0.408 |     103 |      0.15 |
| fear           | 0.496 |     0.576 |  0.436 |      78 |      0.40 |
| gratitude      | 0.793 |     0.787 |  0.798 |     352 |      0.30 |
| grief          | 0.323 |     0.200 |  0.833 |       6 |      0.45 |
| joy            | 0.402 |     0.341 |  0.491 |     161 |      0.15 |
| love           | 0.640 |     0.679 |  0.605 |     238 |      0.30 |
| nervousness    | 0.263 |     0.333 |  0.217 |      23 |      0.70 |
| optimism       | 0.433 |     0.453 |  0.414 |     186 |      0.20 |
| pride          | 0.429 |     0.500 |  0.375 |      16 |      0.50 |
| realization    | 0.177 |     0.159 |  0.200 |     145 |      0.10 |
| relief         | 0.182 |     0.182 |  0.182 |      11 |      0.40 |
| remorse        | 0.541 |     0.500 |  0.589 |      56 |      0.30 |
| sadness        | 0.461 |     0.467 |  0.455 |     156 |      0.20 |
| surprise       | 0.302 |     0.299 |  0.305 |     141 |      0.15 |
| neutral        | 0.620 |     0.505 |  0.803 |    1787 |      0.30 |

The thesholds are stored in `thresholds.json`.

### Use with ONNXRuntime

The input to the model is called `logits`, and there is one output per label. Each output produces a 2d array, with 1 row per input row, and each row having 2 columns - the first being a proba output for the negative case, and the second being a proba output for the positive case.

```python
# Assuming you have embeddings from all-MiniLM-L6-v2 for the input sentences
# E.g. produced from sentence-transformers such as:
#      huggingface.co/sentence-transformers/all-MiniLM-L6-v2
#      or from an ONNX version E.g. huggingface.co/Xenova/all-MiniLM-L6-v2

print(embeddings.shape)  # E.g. a batch of 1 sentence
> (1, 384)

import onnxruntime as ort

sess = ort.InferenceSession("path_to_model_dot_onnx", providers=['CPUExecutionProvider'])

outputs = [o.name for o in sess.get_outputs()]  # list of labels, in the order of the outputs
preds_onnx = sess.run(_outputs, {'logits': embeddings})
# preds_onnx is a list with 28 entries, one per label,
# each with a numpy array of shape (1, 2) given the input was a batch of 1

print(outputs[0])
> surprise
print(preds_onnx[0])
> array([[0.97136074, 0.02863926]], dtype=float32)

# load thresholds.json and use that (per label) to convert the positive case score to a binary prediction 
```

### Commentary on the dataset

Some labels (E.g. gratitude) when considered independently perform very strongly, whilst others (E.g. relief) perform very poorly.

This is a challenging dataset. Labels such as relief do have much fewer examples in the training data (less than 100 out of the 40k+, and only 11 in the test split).

But there is also some ambiguity and/or labelling errors visible in the training data of go_emotions that is suspected to constrain the performance. Data cleaning on the dataset to reduce some of the mistakes, ambiguity, conflicts and duplication in the labelling would produce a higher performing model.