File size: 5,356 Bytes
145284a
3333069
 
 
 
 
 
 
 
 
 
 
 
 
 
 
145284a
3333069
 
 
 
03af1f2
 
3333069
 
8dbaa1c
 
 
03af1f2
8dbaa1c
 
 
0290205
8dbaa1c
 
 
 
3333069
 
55c23b0
3333069
 
8dbaa1c
145284a
3333069
 
 
 
 
 
 
 
4d5eb2e
3333069
 
 
 
 
 
 
e50d9cc
3333069
 
 
 
 
 
 
 
 
 
 
 
 
 
50910fb
3333069
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
---
language:
- en
library_name: nemo
datasets:
- SLURP
thumbnail: null
tags:
- spoken-language-understanding
- speech-intent-classification
- speech-slot-filling
- SLURP
- Conformer
- Transformer
- pytorch
- NeMo
license: cc-by-4.0
model-index:
- name: slu_conformer_transformer_large_slurp
  results:
  - task:
      name: Slot Filling
      type: slot-filling
    dataset:
      name: SLURP
      type: slurp
      split: test
    metrics:
    - name: F1
      type: f1
      value: 82.27
  - task:
      name: Intent Classification
      type: intent-classification
    dataset:
      name: SLURP
      type: slurp
      split: test
    metrics:
    - name: Accuracy
      type: acc
      value: 90.14
 
---
# NeMo End-to-End Speech Intent Classification and Slot Filling

## Model Overview

This model performs joint intent classification and slot filling, directly from audio input. The model treats the problem as an audio-to-text problem, where the output text is the flattened string representation of the semantics annotation. The model is trained on the SLURP dataset [1].

## Model Architecture

The model is has an encoder-decoder architecture, where the encoder is a Conformer-Large model [2], and the decoder is a three-layer Transformer Decoder [3]. We use the Conformer encoder pretrained on NeMo ASR-Set (details [here](https://ngc.nvidia.com/models/nvidia:nemo:stt_en_conformer_ctc_large)), while the decoder is trained from scratch. A start-of-sentence (BOS) and an end-of-sentence (EOS) tokens are added to each sentence. The model is trained end-to-end by minimizing the negative log-likelihood loss with teacher forcing. During inference, the prediction is generated by beam search, where a BOS token is used to trigger the generation process.

## Training

The NeMo toolkit [4] was used for training the models for around 100 epochs. These model are trained with this [example script](https://github.com/NVIDIA/NeMo/blob/main/examples/slu/slurp/run_slurp_train.py) and this [base config](https://github.com/NVIDIA/NeMo/blob/main/examples/slu/slurp/configs/conformer_transformer_large_bpe.yaml).

The tokenizers for these models were built using the semantics annotations of the train set with this [script](https://github.com/NVIDIA/NeMo/blob/main/scripts/tokenizers/process_asr_text_tokenizer.py). We use a vocabulary size of 58, including the BOS, EOS and padding tokens.

Details on how to train the model can be found [here](https://github.com/NVIDIA/NeMo/blob/main/examples/slu/speech_intent_slot/README.md).

### Datasets

The model is trained on the combined real and synthetic training sets of the SLURP dataset. 


## Performance

|       |                                                  |                |                          | **Intent (Scenario_Action)** |               | **Entity** |        |              | **SLURP Metrics** |                     |
|-------|--------------------------------------------------|----------------|--------------------------|------------------------------|---------------|------------|--------|--------------|-------------------|---------------------|
|**Version**|                     **Model**                    | **Params (M)** |      **Pretrained**      |         **Accuracy**         | **Precision** | **Recall** | **F1** | **Precsion** |     **Recall**    |        **F1**       |
|1.13.0| Conformer-Transformer-Large | 127            | NeMo ASR-Set 3.0         |                        90.14 |         78.95 |      74.93 |  76.89 |        84.31 |             80.33 |               82.27 |
|Baseline| Conformer-Transformer-Large               | 127            | None                     |                        72.56 |         43.19 |       43.5 |  43.34 |        53.59 |             53.92 |               53.76 |

Note: during inference, we use beam size of 32, and a temperature of 1.25.


## How to Use this Model

The model is available for use in the NeMo toolkit [3], and can be used on another dataset with the same annotation format.

### Automatically load the model from NGC

```python
import nemo.collections.asr as nemo_asr
asr_model = nemo_asr.models.SLUIntentSlotBPEModel.from_pretrained(model_name="slu_conformer_transformer_large_slurp")
```

### Predict intents and slots with this model

```shell
python [NEMO_GIT_FOLDER]/examples/slu/speech_intent_slot/eval_utils/inference.py \
 pretrained_name="slu_conformer_transformer_slurp" \
 audio_dir="<DIRECTORY CONTAINING AUDIO FILES>" \
 sequence_generator.type="<'beam' OR 'greedy' FOR BEAM/GREEDY SEARCH>" \
 sequence_generator.beam_size="<SIZE OF BEAM>" \
 sequence_generator.temperature="<TEMPERATURE FOR BEAM SEARCH>"
```

### Input

This model accepts 16000 Hz Mono-channel Audio (wav files) as input.

### Output

This model provides the intent and slot annotaions as a string for a given audio sample.

## Limitations

Since this model was trained on only the SLURP dataset [1], the performance of this model might degrade on other datasets.


## References


[1] [SLURP: A Spoken Language Understanding Resource Package](https://arxiv.org/abs/2011.13205)

[2] [Conformer: Convolution-augmented Transformer for Speech Recognition](https://arxiv.org/abs/2005.08100)

[3] [Attention Is All You Need](https://arxiv.org/abs/1706.03762?context=cs)

[4] [NVIDIA NeMo Toolkit](https://github.com/NVIDIA/NeMo)