File size: 6,100 Bytes

---
language: en
tags:
- text-classification
- onnx
- bge-small-en-v1.5
- emotions
- multi-class-classification
- multi-label-classification
datasets:
- go_emotions
models:
- BAAI/bge-small-en-v1.5
license: mit
inference: false
widget:
- text: ONNX is so much faster, its very handy!
---

### Overview

This is a multi-label, multi-class linear classifer for emotions that works with [BGE-small-en-v1.5 embeddings](https://huggingface.co/BAAI/bge-small-en-v1.5), having been trained on the [go_emotions](https://huggingface.co/datasets/go_emotions) dataset.

### Labels

The 28 labels from the [go_emotions](https://huggingface.co/datasets/go_emotions) dataset are:
```
['admiration', 'amusement', 'anger', 'annoyance', 'approval', 'caring', 'confusion', 'curiosity', 'desire', 'disappointment', 'disapproval', 'disgust', 'embarrassment', 'excitement', 'fear', 'gratitude', 'grief', 'joy', 'love', 'nervousness', 'optimism', 'pride', 'realization', 'relief', 'remorse', 'sadness', 'surprise', 'neutral']
```

### Metrics (exact match of labels per item)

This is a multi-label, multi-class dataset, so each label is effectively a separate binary classification. Evaluating across all labels per item in the go_emotions test split the metrics are shown below.

Optimising the threshold per label to optimise the F1 metric, the metrics (evaluated on the go_emotions test split) are:

- Precision: 0.445
- Recall: 0.476
- F1: 0.449

Weighted by the relative support of each label in the dataset, this is:

- Precision: 0.472
- Recall: 0.582
- F1: 0.514

Using a fixed threshold of 0.5 to convert the scores to binary predictions for each label, the metrics (evaluated on the go_emotions test split, and unweighted by support) are:

- Precision: 0.602
- Recall: 0.250
- F1: 0.303

### Metrics (per-label)

This is a multi-label, multi-class dataset, so each label is effectively a separate binary classification and metrics are better measured per label.

Optimising the threshold per label to optimise the F1 metric, the metrics (evaluated on the go_emotions test split) are:
|                |   f1  | precision | recall | support | threshold |
| -------------- | ----- | --------- | ------ | ------- | --------- |
| admiration     | 0.583 | 0.574     | 0.593  |  504    |      0.30 |
| amusement      | 0.668 | 0.722     | 0.621  |  264    |      0.25 |
| anger          | 0.350 | 0.309     | 0.404  |  198    |      0.15 |
| annoyance      | 0.299 | 0.318     | 0.281  |  320    |      0.20 |
| approval       | 0.338 | 0.281     | 0.425  |  351    |      0.15 |
| caring         | 0.321 | 0.323     | 0.319  |  135    |      0.20 |
| confusion      | 0.384 | 0.313     | 0.497  |  153    |      0.15 |
| curiosity      | 0.467 | 0.432     | 0.507  |  284    |      0.20 |
| desire         | 0.426 | 0.381     | 0.482  |   83    |      0.20 |
| disappointment | 0.210 | 0.147     | 0.364  |  151    |      0.10 |
| disapproval    | 0.366 | 0.288     | 0.502  |  267    |      0.15 |
| disgust        | 0.416 | 0.409     | 0.423  |  123    |      0.20 |
| embarrassment  | 0.370 | 0.341     | 0.405  |   37    |      0.30 |
| excitement     | 0.313 | 0.368     | 0.272  |  103    |      0.25 |
| fear           | 0.615 | 0.677     | 0.564  |   78    |      0.40 |
| gratitude      | 0.828 | 0.810     | 0.847  |  352    |      0.25 |
| grief          | 0.545 | 0.600     | 0.500  |    6    |      0.85 |
| joy            | 0.455 | 0.429     | 0.484  |  161    |      0.20 |
| love           | 0.642 | 0.673     | 0.613  |  238    |      0.30 |
| nervousness    | 0.350 | 0.412     | 0.304  |   23    |      0.60 |
| optimism       | 0.439 | 0.417     | 0.462  |  186    |      0.20 |
| pride          | 0.480 | 0.667     | 0.375  |   16    |      0.70 |
| realization    | 0.232 | 0.191     | 0.297  |  145    |      0.10 |
| relief         | 0.353 | 0.500     | 0.273  |   11    |      0.50 |
| remorse        | 0.643 | 0.529     | 0.821  |   56    |      0.20 |
| sadness        | 0.526 | 0.497     | 0.558  |  156    |      0.20 |
| surprise       | 0.329 | 0.318     | 0.340  |  141    |      0.15 |
| neutral        | 0.634 | 0.528     | 0.794  | 1787    |      0.30 |

The thesholds are stored in `thresholds.json`.

### Use with ONNXRuntime

The input to the model is called `logits`, and there is one output per label. Each output produces a 2d array, with 1 row per input row, and each row having 2 columns - the first being a proba output for the negative case, and the second being a proba output for the positive case.

```python
# Assuming you have embeddings from BAAI/bge-small-en-v1.5 for the input sentences
# E.g. produced from sentence-transformers E.g. huggingface.co/BAAI/bge-small-en-v1.5
#      or from an ONNX version E.g. huggingface.co/Xenova/bge-small-en-v1.5

print(embeddings.shape)  # E.g. a batch of 1 sentence
> (1, 384)

import onnxruntime as ort

sess = ort.InferenceSession("path_to_model_dot_onnx", providers=['CPUExecutionProvider'])

outputs = [o.name for o in sess.get_outputs()]  # list of labels, in the order of the outputs
preds_onnx = sess.run(_outputs, {'logits': embeddings})
# preds_onnx is a list with 28 entries, one per label,
# each with a numpy array of shape (1, 2) given the input was a batch of 1

print(outputs[0])
> surprise
print(preds_onnx[0])
> array([[0.97136074, 0.02863926]], dtype=float32)

# load thresholds.json and use that (per label) to convert the positive case score to a binary prediction 
```

### Commentary on the dataset

Some labels (E.g. gratitude) when considered independently perform very strongly, whilst others (E.g. relief) perform very poorly.

This is a challenging dataset. Labels such as relief do have much fewer examples in the training data (less than 100 out of the 40k+, and only 11 in the test split).

But there is also some ambiguity and/or labelling errors visible in the training data of go_emotions that is suspected to constrain the performance. Data cleaning on the dataset to reduce some of the mistakes, ambiguity, conflicts and duplication in the labelling would produce a higher performing model.