End-to-end NLU

End-to-end spoken language understanding (SLU) predicts intent directly from audio using a single model. It promises to improve the performance of assistant systems by leveraging acoustic information lost in the intermediate textual representation and preventing cascading errors from Automatic Speech Recognition (ASR). Further, having one unified model has efficiency advantages when deploying assistant systems on-device.

This page releases the code for reproducing the results in STOP: A dataset for Spoken Task Oriented Semantic Parsing

The dataset can be downloaded here: download link

The low-resource splits can be downloaded here: download link

Pretrained models end-to-end NLU Models

Speech Pretraining	ASR Pretraining	Test EM Accuracy	Tesst EM-Tree Accuracy	Link
None	None	36.54	57.01	link
Wav2Vec	None	68.05	82.53	link
HuBERT	None	68.40	82.85	link
Wav2Vec	STOP	68.70	82.78	link
HuBERT	STOP	69.23	82.87	link
Wav2Vec	Librispeech	68.47	82.49	link
HuBERT	Librispeech	68.70	82.78	link

Pretrained models ASR Models

Speech Pre-training	ASR Dataset	STOP Eval WER	STOP Test WER	dev_other WER	dev_clean WER	test_clean WER	test_other WER	Link
HuBERT	Librispeech	8.47	2.99	3.25	8.06	25.68	26.19	link
Wav2Vec	Librispeech	9.215	3.204	3.334	9.006	27.257	27.588	link
HuBERT	STOP	46.31	31.30	31.52	47.16	4.29	4.26	link
Wav2Vec	STOP	43.103	27.833	28.479	28.479	4.679	4.667	link
HuBERT	Librispeech + STOP	9.015	3.211	3.372	8.635	5.133	5.056	link
Wav2Vec	Librispeech + STOP	9.549	3.537	3.625	9.514	5.59	5.562	link

Creating the fairseq datasets from STOP

First, create the audio file manifests and label files:

python examples/audio_nlp/nlu/generate_manifests.py --stop_root $STOP_DOWNLOAD_DIR/stop --output $FAIRSEQ_DATASET_OUTPUT/

Run ./examples/audio_nlp/nlu/create_dict_stop.sh $FAIRSEQ_DATASET_OUTPUT to generate the fairseq dictionaries.

Training an End-to-end NLU Model

Download a wav2vec or hubert model from link or link

python fairseq_cli/hydra-train  --config-dir examples/audio_nlp/nlu/configs/  --config-name nlu_finetuning task.data=$FAIRSEQ_DATA_OUTPUT model.w2v_path=$PRETRAINED_MODEL_PATH