# Understanding Back-Translation at Scale (Edunov et al., 2018) This page includes pre-trained models from the paper [Understanding Back-Translation at Scale (Edunov et al., 2018)](https://arxiv.org/abs/1808.09381). ## Pre-trained models Model | Description | Dataset | Download ---|---|---|--- `transformer.wmt18.en-de` | Transformer
([Edunov et al., 2018](https://arxiv.org/abs/1808.09381))
WMT'18 winner | [WMT'18 English-German](http://www.statmt.org/wmt18/translation-task.html) | [download (.tar.gz)](https://dl.fbaipublicfiles.com/fairseq/models/wmt18.en-de.ensemble.tar.gz)
See NOTE in the archive ## Example usage (torch.hub) We require a few additional Python dependencies for preprocessing: ```bash pip install subword_nmt sacremoses ``` Then to generate translations from the full model ensemble: ```python import torch # List available models torch.hub.list('pytorch/fairseq') # [..., 'transformer.wmt18.en-de', ... ] # Load the WMT'18 En-De ensemble en2de_ensemble = torch.hub.load( 'pytorch/fairseq', 'transformer.wmt18.en-de', checkpoint_file='wmt18.model1.pt:wmt18.model2.pt:wmt18.model3.pt:wmt18.model4.pt:wmt18.model5.pt', tokenizer='moses', bpe='subword_nmt') # The ensemble contains 5 models len(en2de_ensemble.models) # 5 # Translate en2de_ensemble.translate('Hello world!') # 'Hallo Welt!' ``` ## Training your own model (WMT'18 English-German) The following instructions can be adapted to reproduce the models from the paper. #### Step 1. Prepare parallel data and optionally train a baseline (English-German) model First download and preprocess the data: ```bash # Download and prepare the data cd examples/backtranslation/ bash prepare-wmt18en2de.sh cd ../.. # Binarize the data TEXT=examples/backtranslation/wmt18_en_de fairseq-preprocess \ --joined-dictionary \ --source-lang en --target-lang de \ --trainpref $TEXT/train --validpref $TEXT/valid --testpref $TEXT/test \ --destdir data-bin/wmt18_en_de --thresholdtgt 0 --thresholdsrc 0 \ --workers 20 # Copy the BPE code into the data-bin directory for future use cp examples/backtranslation/wmt18_en_de/code data-bin/wmt18_en_de/code ``` (Optionally) Train a baseline model (English-German) using just the parallel data: ```bash CHECKPOINT_DIR=checkpoints_en_de_parallel fairseq-train --fp16 \ data-bin/wmt18_en_de \ --source-lang en --target-lang de \ --arch transformer_wmt_en_de_big --share-all-embeddings \ --dropout 0.3 --weight-decay 0.0 \ --criterion label_smoothed_cross_entropy --label-smoothing 0.1 \ --optimizer adam --adam-betas '(0.9, 0.98)' --clip-norm 0.0 \ --lr 0.001 --lr-scheduler inverse_sqrt --warmup-updates 4000 \ --max-tokens 3584 --update-freq 16 \ --max-update 30000 \ --save-dir $CHECKPOINT_DIR # Note: the above command assumes 8 GPUs. Adjust `--update-freq` if you have a # different number of GPUs. ``` Average the last 10 checkpoints: ```bash python scripts/average_checkpoints.py \ --inputs $CHECKPOINT_DIR \ --num-epoch-checkpoints 10 \ --output $CHECKPOINT_DIR/checkpoint.avg10.pt ``` Evaluate BLEU: ```bash # tokenized BLEU on newstest2017: bash examples/backtranslation/tokenized_bleu.sh \ wmt17 \ en-de \ data-bin/wmt18_en_de \ data-bin/wmt18_en_de/code \ $CHECKPOINT_DIR/checkpoint.avg10.pt # BLEU4 = 29.57, 60.9/35.4/22.9/15.5 (BP=1.000, ratio=1.014, syslen=63049, reflen=62152) # compare to 29.46 in Table 1, which is also for tokenized BLEU # generally it's better to report (detokenized) sacrebleu though: bash examples/backtranslation/sacrebleu.sh \ wmt17 \ en-de \ data-bin/wmt18_en_de \ data-bin/wmt18_en_de/code \ $CHECKPOINT_DIR/checkpoint.avg10.pt # BLEU+case.mixed+lang.en-de+numrefs.1+smooth.exp+test.wmt17+tok.13a+version.1.4.3 = 29.0 60.6/34.7/22.4/14.9 (BP = 1.000 ratio = 1.013 hyp_len = 62099 ref_len = 61287) ``` #### Step 2. Back-translate monolingual German data Train a reverse model (German-English) to do the back-translation: ```bash CHECKPOINT_DIR=checkpoints_de_en_parallel fairseq-train --fp16 \ data-bin/wmt18_en_de \ --source-lang de --target-lang en \ --arch transformer_wmt_en_de_big --share-all-embeddings \ --dropout 0.3 --weight-decay 0.0 \ --criterion label_smoothed_cross_entropy --label-smoothing 0.1 \ --optimizer adam --adam-betas '(0.9, 0.98)' --clip-norm 0.0 \ --lr 0.001 --lr-scheduler inverse_sqrt --warmup-updates 4000 \ --max-tokens 3584 --update-freq 16 \ --max-update 30000 \ --save-dir $CHECKPOINT_DIR # Note: the above command assumes 8 GPUs. Adjust `--update-freq` if you have a # different number of GPUs. ``` Let's evaluate the back-translation (BT) model to make sure it is well trained: ```bash bash examples/backtranslation/sacrebleu.sh \ wmt17 \ de-en \ data-bin/wmt18_en_de \ data-bin/wmt18_en_de/code \ $CHECKPOINT_DIR/checkpoint_best.py # BLEU+case.mixed+lang.de-en+numrefs.1+smooth.exp+test.wmt17+tok.13a+version.1.4.3 = 34.9 66.9/41.8/28.5/19.9 (BP = 0.983 ratio = 0.984 hyp_len = 63342 ref_len = 64399) # compare to the best system from WMT'17 which scored 35.1: http://matrix.statmt.org/matrix/systems_list/1868 ``` Next prepare the monolingual data: ```bash # Download and prepare the monolingual data # By default the script samples 25M monolingual sentences, which after # deduplication should be just over 24M sentences. These are split into 25 # shards, each with 1M sentences (except for the last shard). cd examples/backtranslation/ bash prepare-de-monolingual.sh cd ../.. # Binarize each shard of the monolingual data TEXT=examples/backtranslation/wmt18_de_mono for SHARD in $(seq -f "%02g" 0 24); do \ fairseq-preprocess \ --only-source \ --source-lang de --target-lang en \ --joined-dictionary \ --srcdict data-bin/wmt18_en_de/dict.de.txt \ --testpref $TEXT/bpe.monolingual.dedup.${SHARD} \ --destdir data-bin/wmt18_de_mono/shard${SHARD} \ --workers 20; \ cp data-bin/wmt18_en_de/dict.en.txt data-bin/wmt18_de_mono/shard${SHARD}/; \ done ``` Now we're ready to perform back-translation over the monolingual data. The following command generates via sampling, but it's possible to use greedy decoding (`--beam 1`), beam search (`--beam 5`), top-k sampling (`--sampling --beam 1 --sampling-topk 10`), etc.: ```bash mkdir backtranslation_output for SHARD in $(seq -f "%02g" 0 24); do \ fairseq-generate --fp16 \ data-bin/wmt18_de_mono/shard${SHARD} \ --path $CHECKPOINT_DIR/checkpoint_best.pt \ --skip-invalid-size-inputs-valid-test \ --max-tokens 4096 \ --sampling --beam 1 \ > backtranslation_output/sampling.shard${SHARD}.out; \ done ``` After BT, use the `extract_bt_data.py` script to re-combine the shards, extract the back-translations and apply length ratio filters: ```bash python examples/backtranslation/extract_bt_data.py \ --minlen 1 --maxlen 250 --ratio 1.5 \ --output backtranslation_output/bt_data --srclang en --tgtlang de \ backtranslation_output/sampling.shard*.out # Ensure lengths are the same: # wc -l backtranslation_output/bt_data.{en,de} # 21795614 backtranslation_output/bt_data.en # 21795614 backtranslation_output/bt_data.de # 43591228 total ``` Binarize the filtered BT data and combine it with the parallel data: ```bash TEXT=backtranslation_output fairseq-preprocess \ --source-lang en --target-lang de \ --joined-dictionary \ --srcdict data-bin/wmt18_en_de/dict.en.txt \ --trainpref $TEXT/bt_data \ --destdir data-bin/wmt18_en_de_bt \ --workers 20 # We want to train on the combined data, so we'll symlink the parallel + BT data # in the wmt18_en_de_para_plus_bt directory. We link the parallel data as "train" # and the BT data as "train1", so that fairseq will combine them automatically # and so that we can use the `--upsample-primary` option to upsample the # parallel data (if desired). PARA_DATA=$(readlink -f data-bin/wmt18_en_de) BT_DATA=$(readlink -f data-bin/wmt18_en_de_bt) COMB_DATA=data-bin/wmt18_en_de_para_plus_bt mkdir -p $COMB_DATA for LANG in en de; do \ ln -s ${PARA_DATA}/dict.$LANG.txt ${COMB_DATA}/dict.$LANG.txt; \ for EXT in bin idx; do \ ln -s ${PARA_DATA}/train.en-de.$LANG.$EXT ${COMB_DATA}/train.en-de.$LANG.$EXT; \ ln -s ${BT_DATA}/train.en-de.$LANG.$EXT ${COMB_DATA}/train1.en-de.$LANG.$EXT; \ ln -s ${PARA_DATA}/valid.en-de.$LANG.$EXT ${COMB_DATA}/valid.en-de.$LANG.$EXT; \ ln -s ${PARA_DATA}/test.en-de.$LANG.$EXT ${COMB_DATA}/test.en-de.$LANG.$EXT; \ done; \ done ``` #### 3. Train an English-German model over the combined parallel + BT data Finally we can train a model over the parallel + BT data: ```bash CHECKPOINT_DIR=checkpoints_en_de_parallel_plus_bt fairseq-train --fp16 \ data-bin/wmt18_en_de_para_plus_bt \ --upsample-primary 16 \ --source-lang en --target-lang de \ --arch transformer_wmt_en_de_big --share-all-embeddings \ --dropout 0.3 --weight-decay 0.0 \ --criterion label_smoothed_cross_entropy --label-smoothing 0.1 \ --optimizer adam --adam-betas '(0.9, 0.98)' --clip-norm 0.0 \ --lr 0.0007 --lr-scheduler inverse_sqrt --warmup-updates 4000 \ --max-tokens 3584 --update-freq 16 \ --max-update 100000 \ --save-dir $CHECKPOINT_DIR # Note: the above command assumes 8 GPUs. Adjust `--update-freq` if you have a # different number of GPUs. ``` Average the last 10 checkpoints: ```bash python scripts/average_checkpoints.py \ --inputs $CHECKPOINT_DIR \ --num-epoch-checkpoints 10 \ --output $CHECKPOINT_DIR/checkpoint.avg10.pt ``` Evaluate BLEU: ```bash # tokenized BLEU on newstest2017: bash examples/backtranslation/tokenized_bleu.sh \ wmt17 \ en-de \ data-bin/wmt18_en_de \ data-bin/wmt18_en_de/code \ $CHECKPOINT_DIR/checkpoint.avg10.pt # BLEU4 = 32.35, 64.4/38.9/26.2/18.3 (BP=0.977, ratio=0.977, syslen=60729, reflen=62152) # compare to 32.35 in Table 1, which is also for tokenized BLEU # generally it's better to report (detokenized) sacrebleu: bash examples/backtranslation/sacrebleu.sh \ wmt17 \ en-de \ data-bin/wmt18_en_de \ data-bin/wmt18_en_de/code \ $CHECKPOINT_DIR/checkpoint.avg10.pt # BLEU+case.mixed+lang.en-de+numrefs.1+smooth.exp+test.wmt17+tok.13a+version.1.4.3 = 31.5 64.3/38.2/25.6/17.6 (BP = 0.971 ratio = 0.971 hyp_len = 59515 ref_len = 61287) ``` ## Citation ```bibtex @inproceedings{edunov2018backtranslation, title = {Understanding Back-Translation at Scale}, author = {Edunov, Sergey and Ott, Myle and Auli, Michael and Grangier, David}, booktitle = {Conference of the Association for Computational Linguistics (ACL)}, year = 2018, } ```