File size: 3,899 Bytes
c5f1fae
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
08a0f6d
 
 
c5f1fae
 
632cc84
 
0dee1f1
632cc84
c5f1fae
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
08a0f6d
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
---
datasets:
- librispeech_asr
- declare-lab/MELD
- PolyAI/minds14
- google/fleurs
language:
- en
metrics:
- accuracy
- f1
- mae
- pearsonr
- exact_match
tags:
- audio
- speech
- pre-training
- spoken language understanding
- music
license: apache-2.0
---

**Repository:** https://github.com/declare-lab/segue

**Paper:** https://arxiv.org/abs/2305.12301

SEGUE is a pre-training approach for sequence-level spoken language understanding (SLU) tasks.
We use knowledge distillation on a parallel speech-text corpus (e.g. an ASR corpus) to distil
language understanding knowledge from a textual sentence embedder to a pre-trained speech encoder.
SEGUE applied to Wav2Vec 2.0 improves performance for many SLU tasks, including
intent classification / slot-filling, spoken sentiment analysis, and spoken emotion classification.
These improvements were observed in both fine-tuned and non-fine-tuned settings, as well as few-shot settings.

## How to Get Started with the Model

To use this model checkpoint, you need to use the model classes on [our GitHub repository](https://github.com/declare-lab/segue).

```python3
from segue.modeling_segue import SegueModel
import soundfile

# assuming this is 16kHz mono audio
raw_audio_array, sampling_rate = soundfile.read('example.wav')

model = SegueModel.from_pretrained('declare-lab/segue-w2v2-base')
inputs = model.processor(audio = raw_audio_array, sampling_rate = sampling_rate)
outputs = model(**inputs)
```

You do not need to create the `Processor` yourself, it is already available as `model.processor`.

`SegueForRegression` and `SegueForClassification` are also available. For classification,
the number of classes can be specified through the n_classes field in model config,
e.g. `SegueForClassification.from_pretrained('declare-lab/segue-w2v2-base', n_classes=7)`.
Multi-label classification is also supported, e.g. `n_classes=[3, 7]` for two labels with 3 and 7 classes respectively.

Pre-training and downstream task training scripts are available on [our GitHub repository](https://github.com/declare-lab/segue).

## Results

We show only simplified MInDS-14 and MELD results for brevity.
Please refer to the paper for full results.

### MInDS-14 (intent classification)

*Note: we used only the en-US subset of MInDS-14.*

#### Fine-tuning

|Model|Accuracy|
|-|-|
|w2v 2.0|89.4±2.3|
|SEGUE|**97.6±0.5**|

*Note: Wav2Vec 2.0 fine-tuning was unstable. Only 3 out of 6 runs converged, the result shown were taken from converged runs only.*

#### Frozen encoder

|Model|Accuracy|
|-|-|
|w2v 2.0|54.0|
|SEGUE|**77.9**|

### MELD (sentiment and emotion classification)

#### Fine-tuning

|Model|Sentiment F1|Emotion F1|
|-|-|-|
|w2v 2.0|47.3|39.3|
|SEGUE|53.2|41.1|
|SEGUE (higher LR)|**54.1**|**47.2**|

*Note: Wav2Vec 2.0 fine-tuning was unstable at the higher LR.*

#### Frozen encoder

|Model|Sentiment F1|Emotion F1|
|-|-|-|
|w2v 2.0|45.0±0.7|34.3±1.2|
|SEGUE|**45.8±0.1**|**35.7±0.3**|

## Limitations

In the paper, we hypothesized that SEGUE may perform worse on tasks that rely less on
understanding and more on word detection. This may explain why SEGUE did not manage to
improve upon Wav2Vec 2.0 on the Fluent Speech Commands (FSC) task. We also experimented with
an ASR task (FLEURS), which heavily relies on word detection, to further demonstrate this.

However, this is does not mean that SEGUE performs worse on intent classification tasks
in general. MInDS-14, was able to benifit greatly from SEGUE despite also being an intent
classification task, as it has more free-form utterances that may benefit more from
understanding.

## Citation

```bibtex
@inproceedings{segue2023,
  title={Sentence Embedder Guided Utterance Encoder (SEGUE) for Spoken Language Understanding},
  author={Tan, Yi Xuan and Majumder, Navonil and Poria, Soujanya},
  booktitle={Interspeech},
  year={2023}
}
```