OFA-OCR / fairseq /examples /speech_to_text /docs /simulst_mustc_example.md
JustinLin610's picture
first commit
ee21b96
|
raw
history blame
No virus
7.77 kB

Simultaneous Speech Translation (SimulST) on MuST-C

This is a tutorial of training and evaluating a transformer wait-k simultaneous model on MUST-C English-Germen Dataset, from SimulMT to SimulST: Adapting Simultaneous Text Translation to End-to-End Simultaneous Speech Translation.

MuST-C is multilingual speech-to-text translation corpus with 8-language translations on English TED talks.

Data Preparation

This section introduces the data preparation for training and evaluation. If you only want to evaluate the model, please jump to Inference & Evaluation

Download and unpack MuST-C data to a path ${MUSTC_ROOT}/en-${TARGET_LANG_ID}, then preprocess it with

# Additional Python packages for S2T data processing/model training
pip install pandas torchaudio sentencepiece

# Generate TSV manifests, features, vocabulary,
# global cepstral and mean estimation,
# and configuration for each language
cd fairseq

python examples/speech_to_text/prep_mustc_data.py \
  --data-root ${MUSTC_ROOT} --task asr \
  --vocab-type unigram --vocab-size 10000 \
  --cmvn-type global

python examples/speech_to_text/prep_mustc_data.py \
  --data-root ${MUSTC_ROOT} --task st \
  --vocab-type unigram --vocab-size 10000 \
  --cmvn-type global

ASR Pretraining

We need a pretrained offline ASR model. Assuming the save directory of the ASR model is ${ASR_SAVE_DIR}. The following command (and the subsequent training commands in this tutorial) assume training on 1 GPU (you can also train on 8 GPUs and remove the --update-freq 8 option).

fairseq-train ${MUSTC_ROOT}/en-de \
  --config-yaml config_asr.yaml --train-subset train_asr --valid-subset dev_asr \
  --save-dir ${ASR_SAVE_DIR} --num-workers 4 --max-tokens 40000 --max-update 100000 \
  --task speech_to_text --criterion label_smoothed_cross_entropy --report-accuracy \
  --arch convtransformer_espnet --optimizer adam --lr 0.0005 --lr-scheduler inverse_sqrt \
  --warmup-updates 10000 --clip-norm 10.0 --seed 1 --update-freq 8

A pretrained ASR checkpoint can be downloaded here

Simultaneous Speech Translation Training

Wait-K with fixed pre-decision module

Fixed pre-decision indicates that the model operate simultaneous policy on the boundaries of fixed chunks. Here is a example of fixed pre-decision ratio 7 (the simultaneous decision is made every 7 encoder states) and a wait-3 policy model. Assuming the save directory is ${ST_SAVE_DIR}

 fairseq-train ${MUSTC_ROOT}/en-de \
        --config-yaml config_st.yaml --train-subset train_st --valid-subset dev_st \
        --save-dir ${ST_SAVE_DIR} --num-workers 8  \
        --optimizer adam --lr 0.0001 --lr-scheduler inverse_sqrt --clip-norm 10.0 \
        --criterion label_smoothed_cross_entropy \
        --warmup-updates 4000 --max-update 100000 --max-tokens 40000 --seed 2 \
        --load-pretrained-encoder-from ${ASR_SAVE_DIR}/checkpoint_best.pt \
        --task speech_to_text  \
        --arch convtransformer_simul_trans_espnet  \
        --simul-type waitk_fixed_pre_decision  \
        --waitk-lagging 3 \
        --fixed-pre-decision-ratio 7 \
        --update-freq 8

Monotonic multihead attention with fixed pre-decision module

 fairseq-train ${MUSTC_ROOT}/en-de \
        --config-yaml config_st.yaml --train-subset train_st --valid-subset dev_st \
        --save-dir ${ST_SAVE_DIR} --num-workers 8  \
        --optimizer adam --lr 0.0001 --lr-scheduler inverse_sqrt --clip-norm 10.0 \
        --warmup-updates 4000 --max-update 100000 --max-tokens 40000 --seed 2 \
        --load-pretrained-encoder-from ${ASR_SAVE_DIR}/${CHECKPOINT_FILENAME} \
        --task speech_to_text  \
        --criterion latency_augmented_label_smoothed_cross_entropy \
        --latency-weight-avg 0.1 \
        --arch convtransformer_simul_trans_espnet  \
        --simul-type infinite_lookback_fixed_pre_decision  \
        --fixed-pre-decision-ratio 7 \
        --update-freq 8

Inference & Evaluation

SimulEval is used for evaluation. The following command is for evaluation.

git clone https://github.com/facebookresearch/SimulEval.git
cd SimulEval
pip install -e .

simuleval \
    --agent ${FAIRSEQ}/examples/speech_to_text/simultaneous_translation/agents/fairseq_simul_st_agent.py
    --source ${SRC_LIST_OF_AUDIO}
    --target ${TGT_FILE}
    --data-bin ${MUSTC_ROOT}/en-de \
    --config config_st.yaml \
    --model-path ${ST_SAVE_DIR}/${CHECKPOINT_FILENAME} \
    --output ${OUTPUT} \
    --scores

The source file ${SRC_LIST_OF_AUDIO} is a list of paths of audio files. Assuming your audio files stored at /home/user/data, it should look like this

/home/user/data/audio-1.wav
/home/user/data/audio-2.wav

Each line of target file ${TGT_FILE} is the translation for each audio file input.

Translation_1
Translation_2

The evaluation runs on the original MUSTC segmentation. The following command will generate the wav list and text file for a evaluation set ${SPLIT} (chose from dev, tst-COMMON and tst-HE) in MUSTC to ${EVAL_DATA}.

python ${FAIRSEQ}/examples/speech_to_text/seg_mustc_data.py \
  --data-root ${MUSTC_ROOT} --lang de \
  --split ${SPLIT} --task st \
  --output ${EVAL_DATA}

The --data-bin and --config should be the same in previous section if you prepare the data from the scratch. If only for evaluation, a prepared data directory can be found here. It contains

  • spm_unigram10000_st.model: a sentencepiece model binary.
  • spm_unigram10000_st.txt: the dictionary file generated by the sentencepiece model.
  • gcmvn.npz: the binary for global cepstral mean and variance.
  • config_st.yaml: the config yaml file. It looks like this. You will need to set the absolute paths for sentencepiece_model and stats_npz_path if the data directory is downloaded.
bpe_tokenizer:
  bpe: sentencepiece
  sentencepiece_model: ABS_PATH_TO_SENTENCEPIECE_MODEL
global_cmvn:
  stats_npz_path: ABS_PATH_TO_GCMVN_FILE
input_channels: 1
input_feat_per_channel: 80
sampling_alpha: 1.0
specaugment:
  freq_mask_F: 27
  freq_mask_N: 1
  time_mask_N: 1
  time_mask_T: 100
  time_mask_p: 1.0
  time_wrap_W: 0
transforms:
  '*':
  - global_cmvn
  _train:
  - global_cmvn
  - specaugment
vocab_filename: spm_unigram10000_st.txt

Notice that once a --data-bin is set, the --config is the base name of the config yaml, not the full path.

Set --model-path to the model checkpoint. A pretrained checkpoint can be downloaded from here, which is a wait-5 model with a pre-decision of 280 ms.

The result of this model on tst-COMMON is:

{
    "Quality": {
        "BLEU": 13.94974229366959
    },
    "Latency": {
        "AL": 1751.8031870037803,
        "AL_CA": 2338.5911762796536,
        "AP": 0.7931395378788959,
        "AP_CA": 0.9405103863210942,
        "DAL": 1987.7811616943081,
        "DAL_CA": 2425.2751560926167
    }
}

If --output ${OUTPUT} option is used, the detailed log and scores will be stored under the ${OUTPUT} directory.

The quality is measured by detokenized BLEU. So make sure that the predicted words sent to the server are detokenized.

The latency metrics are

  • Average Proportion
  • Average Lagging
  • Differentiable Average Lagging

Again they will also be evaluated on detokenized text.