metadata

language:
  - en
library_name: nemo
datasets:
  - SLURP
thumbnail: null
tags:
  - spoken-language-understanding
  - speech-intent-classification
  - speech-slot-filling
  - SLURP
  - Conformer
  - Transformer
  - pytorch
  - NeMo
license: cc-by-4.0
model-index:
  - name: slu_conformer_transformer_large_slurp
    results:
      - task:
          name: Slot Filling
          type: slot-filling
        dataset:
          name: SLURP
          type: slurp
          split: test
        metrics:
          - name: F1
            type: f1
            value: 82.27
      - task:
          name: Intent Classification
          type: intent-classification
        dataset:
          name: SLURP
          type: slurp
          split: test
        metrics:
          - name: Accuracy
            type: acc
            value: 90.14

NeMo End-to-End Speech Intent Classification and Slot Filling

Model Overview

This model performs joint intent classification and slot filling, directly from audio input. The model treats the problem as an audio-to-text problem, where the output text is the flattened string representation of the semantics annotation. The model is trained on the SLURP dataset [1].

Model Architecture

The model is has an encoder-decoder architecture, where the encoder is a Conformer-Large model [2], and the decoder is a three-layer Transformer Decoder [3]. We use the Conformer encoder pretrained on NeMo ASR-Set (details here), while the decoder is trained from scratch. A start-of-sentence (BOS) and an end-of-sentence (EOS) tokens are added to each sentence. The model is trained end-to-end by minimizing the negative log-likelihood loss with teacher forcing. During inference, the prediction is generated by beam search, where a BOS token is used to trigger the generation process.

Training

The NeMo toolkit [4] was used for training the models for around 100 epochs. These model are trained with this example script and this base config.

The tokenizers for these models were built using the semantics annotations of the train set with this script. We use a vocabulary size of 58, including the BOS, EOS and padding tokens.

Details on how to train the model can be found here.

Datasets

The model is trained on the combined real and synthetic training sets of the SLURP dataset.

Performance

				Intent (Scenario_Action)		Entity			SLURP Metrics
Version	Model	Params (M)	Pretrained	Accuracy	Precision	Recall	F1	Precsion	Recall	F1
1.13.0	Conformer-Transformer-Large	127	NeMo ASR-Set 3.0	90.14	78.95	74.93	76.89	84.31	80.33	82.27
Baseline	Conformer-Transformer-Large	127	None	72.56	43.19	43.5	43.34	53.59	53.92	53.76

Note: during inference, we use beam size of 32, and a temperature of 1.25.

How to Use this Model

The model is available for use in the NeMo toolkit [3], and can be used on another dataset with the same annotation format.

Automatically load the model from NGC

import nemo.collections.asr as nemo_asr
asr_model = nemo_asr.models.SLUIntentSlotBPEModel.from_pretrained(model_name="slu_conformer_transformer_large_slurp")

Predict intents and slots with this model

python [NEMO_GIT_FOLDER]/examples/slu/speech_intent_slot/eval_utils/inference.py \
 pretrained_name="slu_conformer_transformer_slurp" \
 audio_dir="<DIRECTORY CONTAINING AUDIO FILES>" \
 sequence_generator.type="<'beam' OR 'greedy' FOR BEAM/GREEDY SEARCH>" \
 sequence_generator.beam_size="<SIZE OF BEAM>" \
 sequence_generator.temperature="<TEMPERATURE FOR BEAM SEARCH>"

Input

This model accepts 16000 Hz Mono-channel Audio (wav files) as input.

Output

This model provides the intent and slot annotaions as a string for a given audio sample.

Limitations

Since this model was trained on only the SLURP dataset [1], the performance of this model might degrade on other datasets.

References

[1] SLURP: A Spoken Language Understanding Resource Package

[2] Conformer: Convolution-augmented Transformer for Speech Recognition

[3] Attention Is All You Need

[4] NVIDIA NeMo Toolkit