End-to-end NLU
End-to-end spoken language understanding (SLU) predicts intent directly from audio using a single model. It promises to improve the performance of assistant systems by leveraging acoustic information lost in the intermediate textual representation and preventing cascading errors from Automatic Speech Recognition (ASR). Further, having one unified model has efficiency advantages when deploying assistant systems on-device.
This page releases the code for reproducing the results in STOP: A dataset for Spoken Task Oriented Semantic Parsing
The dataset can be downloaded here: download link
The low-resource splits can be downloaded here: download link
Pretrained models end-to-end NLU Models
Speech Pretraining | ASR Pretraining | Test EM Accuracy | Tesst EM-Tree Accuracy | Link |
---|---|---|---|---|
None | None | 36.54 | 57.01 | link |
Wav2Vec | None | 68.05 | 82.53 | link |
HuBERT | None | 68.40 | 82.85 | link |
Wav2Vec | STOP | 68.70 | 82.78 | link |
HuBERT | STOP | 69.23 | 82.87 | link |
Wav2Vec | Librispeech | 68.47 | 82.49 | link |
HuBERT | Librispeech | 68.70 | 82.78 | link |
Pretrained models ASR Models
Speech Pre-training | ASR Dataset | STOP Eval WER | STOP Test WER | dev_other WER | dev_clean WER | test_clean WER | test_other WER | Link |
---|---|---|---|---|---|---|---|---|
HuBERT | Librispeech | 8.47 | 2.99 | 3.25 | 8.06 | 25.68 | 26.19 | link |
Wav2Vec | Librispeech | 9.215 | 3.204 | 3.334 | 9.006 | 27.257 | 27.588 | link |
HuBERT | STOP | 46.31 | 31.30 | 31.52 | 47.16 | 4.29 | 4.26 | link |
Wav2Vec | STOP | 43.103 | 27.833 | 28.479 | 28.479 | 4.679 | 4.667 | link |
HuBERT | Librispeech + STOP | 9.015 | 3.211 | 3.372 | 8.635 | 5.133 | 5.056 | link |
Wav2Vec | Librispeech + STOP | 9.549 | 3.537 | 3.625 | 9.514 | 5.59 | 5.562 | link |
Creating the fairseq datasets from STOP
First, create the audio file manifests and label files:
python examples/audio_nlp/nlu/generate_manifests.py --stop_root $STOP_DOWNLOAD_DIR/stop --output $FAIRSEQ_DATASET_OUTPUT/
Run ./examples/audio_nlp/nlu/create_dict_stop.sh $FAIRSEQ_DATASET_OUTPUT
to generate the fairseq dictionaries.
Training an End-to-end NLU Model
Download a wav2vec or hubert model from link or link
python fairseq_cli/hydra-train --config-dir examples/audio_nlp/nlu/configs/ --config-name nlu_finetuning task.data=$FAIRSEQ_DATA_OUTPUT model.w2v_path=$PRETRAINED_MODEL_PATH