File size: 7,770 Bytes
ee21b96
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
# Simultaneous Speech Translation (SimulST) on MuST-C

This is a tutorial of training and evaluating a transformer *wait-k* simultaneous model on MUST-C English-Germen Dataset, from [SimulMT to SimulST: Adapting Simultaneous Text Translation to End-to-End Simultaneous Speech Translation](https://www.aclweb.org/anthology/2020.aacl-main.58.pdf).

[MuST-C](https://www.aclweb.org/anthology/N19-1202) is multilingual speech-to-text translation corpus with 8-language translations on English TED talks.

## Data Preparation
This section introduces the data preparation for training and evaluation.
If you only want to evaluate the model, please jump to [Inference & Evaluation](#inference--evaluation)

[Download](https://ict.fbk.eu/must-c) and unpack MuST-C data to a path
`${MUSTC_ROOT}/en-${TARGET_LANG_ID}`, then preprocess it with
```bash
# Additional Python packages for S2T data processing/model training
pip install pandas torchaudio sentencepiece

# Generate TSV manifests, features, vocabulary,
# global cepstral and mean estimation,
# and configuration for each language
cd fairseq

python examples/speech_to_text/prep_mustc_data.py \
  --data-root ${MUSTC_ROOT} --task asr \
  --vocab-type unigram --vocab-size 10000 \
  --cmvn-type global

python examples/speech_to_text/prep_mustc_data.py \
  --data-root ${MUSTC_ROOT} --task st \
  --vocab-type unigram --vocab-size 10000 \
  --cmvn-type global
```

## ASR Pretraining
We need a pretrained offline ASR model. Assuming the save directory of the ASR model is `${ASR_SAVE_DIR}`.
The following command (and the subsequent training commands in this tutorial) assume training on 1 GPU (you can also train on 8 GPUs and remove the `--update-freq 8` option).
```
fairseq-train ${MUSTC_ROOT}/en-de \
  --config-yaml config_asr.yaml --train-subset train_asr --valid-subset dev_asr \
  --save-dir ${ASR_SAVE_DIR} --num-workers 4 --max-tokens 40000 --max-update 100000 \
  --task speech_to_text --criterion label_smoothed_cross_entropy --report-accuracy \
  --arch convtransformer_espnet --optimizer adam --lr 0.0005 --lr-scheduler inverse_sqrt \
  --warmup-updates 10000 --clip-norm 10.0 --seed 1 --update-freq 8
```
A pretrained ASR checkpoint can be downloaded [here](https://dl.fbaipublicfiles.com/simultaneous_translation/must_c_v1_en_de_pretrained_asr)

## Simultaneous Speech Translation Training

### Wait-K with fixed pre-decision module
Fixed pre-decision indicates that the model operate simultaneous policy on the boundaries of fixed chunks.
Here is a example of fixed pre-decision ratio 7 (the simultaneous decision is made every 7 encoder states) and
a wait-3 policy model. Assuming the save directory is `${ST_SAVE_DIR}`
```bash
 fairseq-train ${MUSTC_ROOT}/en-de \
        --config-yaml config_st.yaml --train-subset train_st --valid-subset dev_st \
        --save-dir ${ST_SAVE_DIR} --num-workers 8  \
        --optimizer adam --lr 0.0001 --lr-scheduler inverse_sqrt --clip-norm 10.0 \
        --criterion label_smoothed_cross_entropy \
        --warmup-updates 4000 --max-update 100000 --max-tokens 40000 --seed 2 \
        --load-pretrained-encoder-from ${ASR_SAVE_DIR}/checkpoint_best.pt \
        --task speech_to_text  \
        --arch convtransformer_simul_trans_espnet  \
        --simul-type waitk_fixed_pre_decision  \
        --waitk-lagging 3 \
        --fixed-pre-decision-ratio 7 \
        --update-freq 8

```
### Monotonic multihead attention with fixed pre-decision module
```
 fairseq-train ${MUSTC_ROOT}/en-de \
        --config-yaml config_st.yaml --train-subset train_st --valid-subset dev_st \
        --save-dir ${ST_SAVE_DIR} --num-workers 8  \
        --optimizer adam --lr 0.0001 --lr-scheduler inverse_sqrt --clip-norm 10.0 \
        --warmup-updates 4000 --max-update 100000 --max-tokens 40000 --seed 2 \
        --load-pretrained-encoder-from ${ASR_SAVE_DIR}/${CHECKPOINT_FILENAME} \
        --task speech_to_text  \
        --criterion latency_augmented_label_smoothed_cross_entropy \
        --latency-weight-avg 0.1 \
        --arch convtransformer_simul_trans_espnet  \
        --simul-type infinite_lookback_fixed_pre_decision  \
        --fixed-pre-decision-ratio 7 \
        --update-freq 8
```
## Inference & Evaluation
[SimulEval](https://github.com/facebookresearch/SimulEval) is used for evaluation.
The following command is for evaluation.

```
git clone https://github.com/facebookresearch/SimulEval.git
cd SimulEval
pip install -e .

simuleval \
    --agent ${FAIRSEQ}/examples/speech_to_text/simultaneous_translation/agents/fairseq_simul_st_agent.py
    --source ${SRC_LIST_OF_AUDIO}
    --target ${TGT_FILE}
    --data-bin ${MUSTC_ROOT}/en-de \
    --config config_st.yaml \
    --model-path ${ST_SAVE_DIR}/${CHECKPOINT_FILENAME} \
    --output ${OUTPUT} \
    --scores
```

The source file `${SRC_LIST_OF_AUDIO}` is a list of paths of audio files. Assuming your audio files stored at `/home/user/data`,
it should look like this

```bash
/home/user/data/audio-1.wav
/home/user/data/audio-2.wav
```

Each line of target file `${TGT_FILE}` is the translation for each audio file input.
```bash
Translation_1
Translation_2
```
The evaluation runs on the original MUSTC segmentation.
The following command will generate the wav list and text file for a evaluation set `${SPLIT}` (chose from `dev`, `tst-COMMON` and `tst-HE`) in MUSTC to `${EVAL_DATA}`.
```bash
python ${FAIRSEQ}/examples/speech_to_text/seg_mustc_data.py \
  --data-root ${MUSTC_ROOT} --lang de \
  --split ${SPLIT} --task st \
  --output ${EVAL_DATA}
```

The `--data-bin` and `--config` should be the same in previous section if you prepare the data from the scratch.
If only for evaluation, a prepared data directory can be found [here](https://dl.fbaipublicfiles.com/simultaneous_translation/must_c_v1.0_en_de_databin.tgz). It contains
- `spm_unigram10000_st.model`: a sentencepiece model binary.
- `spm_unigram10000_st.txt`: the dictionary file generated by the sentencepiece model.
- `gcmvn.npz`: the binary for global cepstral mean and variance.
- `config_st.yaml`: the config yaml file. It looks like this.
You will need to set the absolute paths for `sentencepiece_model` and `stats_npz_path` if the data directory is downloaded.
```yaml
bpe_tokenizer:
  bpe: sentencepiece
  sentencepiece_model: ABS_PATH_TO_SENTENCEPIECE_MODEL
global_cmvn:
  stats_npz_path: ABS_PATH_TO_GCMVN_FILE
input_channels: 1
input_feat_per_channel: 80
sampling_alpha: 1.0
specaugment:
  freq_mask_F: 27
  freq_mask_N: 1
  time_mask_N: 1
  time_mask_T: 100
  time_mask_p: 1.0
  time_wrap_W: 0
transforms:
  '*':
  - global_cmvn
  _train:
  - global_cmvn
  - specaugment
vocab_filename: spm_unigram10000_st.txt
```

Notice that once a `--data-bin` is set, the `--config` is the base name of the config yaml, not the full path.

Set `--model-path` to the model checkpoint.
A pretrained checkpoint can be downloaded from [here](https://dl.fbaipublicfiles.com/simultaneous_translation/convtransformer_wait5_pre7), which is a wait-5 model with a pre-decision of 280 ms.

The result of this model on `tst-COMMON` is:
```bash
{
    "Quality": {
        "BLEU": 13.94974229366959
    },
    "Latency": {
        "AL": 1751.8031870037803,
        "AL_CA": 2338.5911762796536,
        "AP": 0.7931395378788959,
        "AP_CA": 0.9405103863210942,
        "DAL": 1987.7811616943081,
        "DAL_CA": 2425.2751560926167
    }
}
```

If `--output ${OUTPUT}` option is used, the detailed log and scores will be stored under the `${OUTPUT}` directory.


The quality is measured by detokenized BLEU. So make sure that the predicted words sent to the server are detokenized.

The latency metrics are
* Average Proportion
* Average Lagging
* Differentiable Average Lagging

Again they will also be evaluated on detokenized text.