Spaces:

OFA-Sys
/

OFA-OCR

Runtime error

App Files Files Community

OFA-OCR / fairseq /examples /speech_to_text /docs /simulst_mustc_example.md

JustinLin610

first commit

ee21b96 over 1 year ago

preview code

raw

history blame

No virus

7.77 kB

	# Simultaneous Speech Translation (SimulST) on MuST-C

	This is a tutorial of training and evaluating a transformer wait-k simultaneous model on MUST-C English-Germen Dataset, from [SimulMT to SimulST: Adapting Simultaneous Text Translation to End-to-End Simultaneous Speech Translation](https://www.aclweb.org/anthology/2020.aacl-main.58.pdf).

	[MuST-C](https://www.aclweb.org/anthology/N19-1202) is multilingual speech-to-text translation corpus with 8-language translations on English TED talks.

	## Data Preparation
	This section introduces the data preparation for training and evaluation.
	If you only want to evaluate the model, please jump to [Inference & Evaluation](#inference--evaluation)

	[Download](https://ict.fbk.eu/must-c) and unpack MuST-C data to a path
	`${MUSTC_ROOT}/en-${TARGET_LANG_ID}`, then preprocess it with
	```bash
	# Additional Python packages for S2T data processing/model training
	pip install pandas torchaudio sentencepiece

	# Generate TSV manifests, features, vocabulary,
	# global cepstral and mean estimation,
	# and configuration for each language
	cd fairseq

	python examples/speech_to_text/prep_mustc_data.py \
	--data-root ${MUSTC_ROOT} --task asr \
	--vocab-type unigram --vocab-size 10000 \
	--cmvn-type global

	python examples/speech_to_text/prep_mustc_data.py \
	--data-root ${MUSTC_ROOT} --task st \
	--vocab-type unigram --vocab-size 10000 \
	--cmvn-type global
	```

	## ASR Pretraining
	We need a pretrained offline ASR model. Assuming the save directory of the ASR model is `${ASR_SAVE_DIR}`.
	The following command (and the subsequent training commands in this tutorial) assume training on 1 GPU (you can also train on 8 GPUs and remove the `--update-freq 8` option).
	```
	fairseq-train ${MUSTC_ROOT}/en-de \
	--config-yaml config_asr.yaml --train-subset train_asr --valid-subset dev_asr \
	--save-dir ${ASR_SAVE_DIR} --num-workers 4 --max-tokens 40000 --max-update 100000 \
	--task speech_to_text --criterion label_smoothed_cross_entropy --report-accuracy \
	--arch convtransformer_espnet --optimizer adam --lr 0.0005 --lr-scheduler inverse_sqrt \
	--warmup-updates 10000 --clip-norm 10.0 --seed 1 --update-freq 8
	```
	A pretrained ASR checkpoint can be downloaded [here](https://dl.fbaipublicfiles.com/simultaneous_translation/must_c_v1_en_de_pretrained_asr)

	## Simultaneous Speech Translation Training

	### Wait-K with fixed pre-decision module
	Fixed pre-decision indicates that the model operate simultaneous policy on the boundaries of fixed chunks.
	Here is a example of fixed pre-decision ratio 7 (the simultaneous decision is made every 7 encoder states) and
	a wait-3 policy model. Assuming the save directory is `${ST_SAVE_DIR}`
	```bash
	fairseq-train ${MUSTC_ROOT}/en-de \
	--config-yaml config_st.yaml --train-subset train_st --valid-subset dev_st \
	--save-dir ${ST_SAVE_DIR} --num-workers 8 \
	--optimizer adam --lr 0.0001 --lr-scheduler inverse_sqrt --clip-norm 10.0 \
	--criterion label_smoothed_cross_entropy \
	--warmup-updates 4000 --max-update 100000 --max-tokens 40000 --seed 2 \
	--load-pretrained-encoder-from ${ASR_SAVE_DIR}/checkpoint_best.pt \
	--task speech_to_text \
	--arch convtransformer_simul_trans_espnet \
	--simul-type waitk_fixed_pre_decision \
	--waitk-lagging 3 \
	--fixed-pre-decision-ratio 7 \
	--update-freq 8

	```
	### Monotonic multihead attention with fixed pre-decision module
	```
	fairseq-train ${MUSTC_ROOT}/en-de \
	--config-yaml config_st.yaml --train-subset train_st --valid-subset dev_st \
	--save-dir ${ST_SAVE_DIR} --num-workers 8 \
	--optimizer adam --lr 0.0001 --lr-scheduler inverse_sqrt --clip-norm 10.0 \
	--warmup-updates 4000 --max-update 100000 --max-tokens 40000 --seed 2 \
	--load-pretrained-encoder-from ${ASR_SAVE_DIR}/${CHECKPOINT_FILENAME} \
	--task speech_to_text \
	--criterion latency_augmented_label_smoothed_cross_entropy \
	--latency-weight-avg 0.1 \
	--arch convtransformer_simul_trans_espnet \
	--simul-type infinite_lookback_fixed_pre_decision \
	--fixed-pre-decision-ratio 7 \
	--update-freq 8
	```
	## Inference & Evaluation
	[SimulEval](https://github.com/facebookresearch/SimulEval) is used for evaluation.
	The following command is for evaluation.

	```
	git clone https://github.com/facebookresearch/SimulEval.git
	cd SimulEval
	pip install -e .

	simuleval \
	--agent ${FAIRSEQ}/examples/speech_to_text/simultaneous_translation/agents/fairseq_simul_st_agent.py
	--source ${SRC_LIST_OF_AUDIO}
	--target ${TGT_FILE}
	--data-bin ${MUSTC_ROOT}/en-de \
	--config config_st.yaml \
	--model-path ${ST_SAVE_DIR}/${CHECKPOINT_FILENAME} \
	--output ${OUTPUT} \
	--scores
	```

	The source file `${SRC_LIST_OF_AUDIO}` is a list of paths of audio files. Assuming your audio files stored at `/home/user/data`,
	it should look like this

	```bash
	/home/user/data/audio-1.wav
	/home/user/data/audio-2.wav
	```

	Each line of target file `${TGT_FILE}` is the translation for each audio file input.
	```bash
	Translation_1
	Translation_2
	```
	The evaluation runs on the original MUSTC segmentation.
	The following command will generate the wav list and text file for a evaluation set `${SPLIT}` (chose from `dev`, `tst-COMMON` and `tst-HE`) in MUSTC to `${EVAL_DATA}`.
	```bash
	python ${FAIRSEQ}/examples/speech_to_text/seg_mustc_data.py \
	--data-root ${MUSTC_ROOT} --lang de \
	--split ${SPLIT} --task st \
	--output ${EVAL_DATA}
	```

	The `--data-bin` and `--config` should be the same in previous section if you prepare the data from the scratch.
	If only for evaluation, a prepared data directory can be found [here](https://dl.fbaipublicfiles.com/simultaneous_translation/must_c_v1.0_en_de_databin.tgz). It contains
	- `spm_unigram10000_st.model`: a sentencepiece model binary.
	- `spm_unigram10000_st.txt`: the dictionary file generated by the sentencepiece model.
	- `gcmvn.npz`: the binary for global cepstral mean and variance.
	- `config_st.yaml`: the config yaml file. It looks like this.
	You will need to set the absolute paths for `sentencepiece_model` and `stats_npz_path` if the data directory is downloaded.
	```yaml
	bpe_tokenizer:
	bpe: sentencepiece
	sentencepiece_model: ABS_PATH_TO_SENTENCEPIECE_MODEL
	global_cmvn:
	stats_npz_path: ABS_PATH_TO_GCMVN_FILE
	input_channels: 1
	input_feat_per_channel: 80
	sampling_alpha: 1.0
	specaugment:
	freq_mask_F: 27
	freq_mask_N: 1
	time_mask_N: 1
	time_mask_T: 100
	time_mask_p: 1.0
	time_wrap_W: 0
	transforms:
	'*':
	- global_cmvn
	_train:
	- global_cmvn
	- specaugment
	vocab_filename: spm_unigram10000_st.txt
	```

	Notice that once a `--data-bin` is set, the `--config` is the base name of the config yaml, not the full path.

	Set `--model-path` to the model checkpoint.
	A pretrained checkpoint can be downloaded from [here](https://dl.fbaipublicfiles.com/simultaneous_translation/convtransformer_wait5_pre7), which is a wait-5 model with a pre-decision of 280 ms.

	The result of this model on `tst-COMMON` is:
	```bash
	{
	"Quality": {
	"BLEU": 13.94974229366959
	},
	"Latency": {
	"AL": 1751.8031870037803,
	"AL_CA": 2338.5911762796536,
	"AP": 0.7931395378788959,
	"AP_CA": 0.9405103863210942,
	"DAL": 1987.7811616943081,
	"DAL_CA": 2425.2751560926167
	}
	}
	```

	If `--output ${OUTPUT}` option is used, the detailed log and scores will be stored under the `${OUTPUT}` directory.


	The quality is measured by detokenized BLEU. So make sure that the predicted words sent to the server are detokenized.

	The latency metrics are
	* Average Proportion
	* Average Lagging
	* Differentiable Average Lagging

	Again they will also be evaluated on detokenized text.