steveheh commited on
Commit
3333069
1 Parent(s): 145284a

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +124 -0
README.md CHANGED
@@ -1,3 +1,127 @@
1
  ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
2
  license: cc-by-4.0
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
3
  ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
+ language:
3
+ - en
4
+ library_name: nemo
5
+ datasets:
6
+ - SLURP
7
+ thumbnail: null
8
+ tags:
9
+ - spoken-language-understanding
10
+ - speech-intent-classification
11
+ - speech-slot-filling
12
+ - SLURP
13
+ - Conformer
14
+ - Transformer
15
+ - pytorch
16
+ - NeMo
17
  license: cc-by-4.0
18
+ model-index:
19
+ - name: slu_conformer_transformer_large_slurp
20
+ results:
21
+ - task:
22
+ name: Spoken Language Understanding
23
+ type: spoken-language-understanding
24
+ dataset:
25
+ name: SLURP
26
+ type: spoken-language-understanding
27
+ split: test
28
+ metrics:
29
+ - name: Intent Accuracy
30
+ type: acc
31
+ value: 90.14
32
+ - name: SLURP Precision
33
+ type: precision
34
+ value: 84.31
35
+ - name: SLURP Recall
36
+ type: recall
37
+ value: 80.33
38
+ - name: SLURP F1
39
+ type: f1
40
+ value: 82.27
41
+
42
  ---
43
+ # NeMo End-to-End Speech Intent Classification and Slot Filling
44
+
45
+ ## Model Overview
46
+
47
+ This model performs joint intent classification and slot filling, directly from audio input. The model treats the problem as an audio-to-text problem, where the output text is the flattened string representation of the semantics annotation. The model is trained on the SLURP dataset [1].
48
+
49
+ ## Model Architecture
50
+
51
+ The model is has an encoder-decoder architecture, where the encoder is a Conformer-Large model [2], and the decoder is a three-layer Transformer Decoder [3]. We use the Conformer encoder pretrained on NeMo ASR-Set (details [here](https://ngc.nvidia.com/models/nvidia:nemo:stt_en_conformer_ctc_large)), while the decoder is trained from scratch. A start-of-sentence (BOS) and an end-of-sentence (EOS) tokens are added to each sentence. The model is trained end-to-end by minimizing the negative log-likelihood loss with label smoothing and teacher forcing. During inference, the prediction is generated by beam search, where a BOS token is used to trigger the generation process.
52
+
53
+ ## Training
54
+
55
+ The NeMo toolkit [4] was used for training the models for around 100 epochs. These model are trained with this [example script](https://github.com/NVIDIA/NeMo/blob/main/examples/slu/slurp/run_slurp_train.py) and this [base config](https://github.com/NVIDIA/NeMo/blob/main/examples/slu/slurp/configs/conformer_transformer_large_bpe.yaml).
56
+
57
+ The tokenizers for these models were built using the semantics annotations of the train set with this [script](https://github.com/NVIDIA/NeMo/blob/main/scripts/tokenizers/process_asr_text_tokenizer.py). We use a vocabulary size of 58, including the BOS, EOS and padding tokens.
58
+
59
+
60
+ ### Datasets
61
+
62
+ The model is trained on the combined real and synthetic training sets of the SLURP dataset.
63
+
64
+
65
+ ## Performance
66
+
67
+ | | | | | **Intent (Scenario_Action)** | | **Entity** | | | **SLURP Metrics** | |
68
+ |-------|--------------------------------------------------|----------------|--------------------------|------------------------------|---------------|------------|--------|--------------|-------------------|---------------------|
69
+ |**Version**| **Model** | **Params (M)** | **Pretrained** | **Accuracy** | **Precision** | **Recall** | **F1** | **Precsion** | **Recall** | **F1** |
70
+ |1.13.0| Conformer-Transformer-Large | 127 | NeMo ASR-Set 3.0 | 90.14 | 78.95 | 74.93 | 76.89 | 84.31 | 80.33 | 82.27 |
71
+ |Baseline| Conformer-Transformer-Large | 127 | None | 72.56 | 43.19 | 43.5 | 43.34 | 53.59 | 53.92 | 53.76 |
72
+
73
+ NoDuring inference, we use beam size of 32, and a temperature of 1.25.
74
+
75
+
76
+ ## How to Use this Model
77
+
78
+ The model is available for use in the NeMo toolkit [3], and can be used on another dataset with the same annotation format.
79
+
80
+ ### Automatically load the model from NGC
81
+
82
+ ```python
83
+ import nemo.collections.asr as nemo_asr
84
+ asr_model = nemo_asr.models.SLUIntentSlotBPEModel.from_pretrained(model_name="slu_conformer_transformer_large_slurp")
85
+ ```
86
+
87
+ ### Predict intents and slots with this model
88
+
89
+ ```shell
90
+ python [NEMO_GIT_FOLDER]/examples/slu/speech_intent_slot/eval_utils/inference.py \
91
+ pretrained_name="slu_conformer_transformer_slurp" \
92
+ audio_dir="<DIRECTORY CONTAINING AUDIO FILES>" \
93
+ sequence_generator.type="<'beam' OR 'greedy' FOR BEAM/GREEDY SEARCH>" \
94
+ sequence_generator.beam_size="<SIZE OF BEAM>" \
95
+ sequence_generator.temperature="<TEMPERATURE FOR BEAM SEARCH>"
96
+ ```
97
+
98
+ ### Input
99
+
100
+ This model accepts 16000 Hz Mono-channel Audio (wav files) as input.
101
+
102
+ ### Output
103
+
104
+ This model provides the intent and slot annotaions as a string for a given audio sample.
105
+
106
+ ## Limitations
107
+
108
+ Since this model was trained on only the SLURP dataset [1], the performance of this model might degrade on other datasets.
109
+
110
+
111
+ ## References
112
+
113
+
114
+ [1] [SLURP: A Spoken Language Understanding Resource Package](https://arxiv.org/abs/2011.13205)
115
+
116
+ [2] [Conformer: Convolution-augmented Transformer for Speech Recognition](https://arxiv.org/abs/2005.08100)
117
+
118
+ [3] [Attention Is All You Need](https://arxiv.org/abs/1706.03762?context=cs)
119
+
120
+ [4] [NVIDIA NeMo Toolkit](https://github.com/NVIDIA/NeMo)
121
+
122
+ ## Licence
123
+
124
+ License to use this model is covered by the NGC [TERMS OF USE](https://ngc.nvidia.com/legal/terms) unless another License/Terms Of Use/EULA is clearly specified. By downloading the public and release version of the model, you accept the terms and conditions of the NGC [TERMS OF USE](https://ngc.nvidia.com/legal/terms).
125
+
126
+
127
+