JustinLin610
update
10b0761

Flores101: Large-Scale Multilingual Machine Translation

Introduction

Baseline pretrained models for small and large tracks of WMT 21 Large-Scale Multilingual Machine Translation competition.

Flores Task at WMT 21: http://www.statmt.org/wmt21/large-scale-multilingual-translation-task.html

Flores announement blog post: https://ai.facebook.com/blog/flores-researchers-kick-off-multilingual-translation-challenge-at-wmt-and-call-for-compute-grants/

Pretrained models

Model Num layers Embed dimension FFN dimension Vocab Size #params Download
flores101_mm100_615M 12 1024 4096 256,000 615M https://dl.fbaipublicfiles.com/flores101/pretrained_models/flores101_mm100_615M.tar.gz
flores101_mm100_175M 6 512 2048 256,000 175M https://dl.fbaipublicfiles.com/flores101/pretrained_models/flores101_mm100_175M.tar.gz

These models are trained similar to M2M-100 with additional support for the languages that are part of the WMT Large-Scale Multilingual Machine Translation track. Full list of languages can be found at the bottom.

Example Generation code

Download model, sentencepiece vocab

fairseq=/path/to/fairseq
cd $fairseq

# Download 615M param model.
wget https://dl.fbaipublicfiles.com/flores101/pretrained_models/flores101_mm100_615M.tar.gz

# Extract 
tar -xvzf flores101_mm100_615M.tar.gz

Encode using our SentencePiece Model

Note: Install SentencePiece from here

fairseq=/path/to/fairseq
cd $fairseq

# Download example dataset From German to French
sacrebleu --echo src -l de-fr -t wmt19 | head -n 20 > raw_input.de-fr.de
sacrebleu --echo ref -l de-fr -t wmt19 | head -n 20 > raw_input.de-fr.fr

for lang in de fr ; do
    python scripts/spm_encode.py \
        --model flores101_mm100_615M/sentencepiece.bpe.model \
        --output_format=piece \
        --inputs=raw_input.de-fr.${lang} \
        --outputs=spm.de-fr.${lang}
done

Binarization

fairseq-preprocess \
    --source-lang de --target-lang fr \
    --testpref spm.de-fr \
    --thresholdsrc 0 --thresholdtgt 0 \
    --destdir data_bin \
    --srcdict flores101_mm100_615M/dict.txt --tgtdict flores101_mm100_615M/dict.txt

Generation

fairseq-generate \
    data_bin \
    --batch-size 1 \
    --path flores101_mm100_615M/model.pt \
    --fixed-dictionary flores101_mm100_615M/dict.txt \
    -s de -t fr \
    --remove-bpe 'sentencepiece' \
    --beam 5 \
    --task translation_multi_simple_epoch \
    --lang-pairs flores101_mm100_615M/language_pairs.txt \
    --decoder-langtok --encoder-langtok src \
    --gen-subset test \
    --fp16 \
    --dataset-impl mmap \
    --distributed-world-size 1 --distributed-no-spawn

Supported Languages and lang code

Language lang code
Akrikaans af
Amharic am
Arabic ar
Assamese as
Asturian ast
Aymara ay
Azerbaijani az
Bashkir ba
Belarusian be
Bulgarian bg
Bengali bn
Breton br
Bosnian bs
Catalan ca
Cebuano ceb
Chokwe cjk
Czech cs
Welsh cy
Danish da
German de
Dyula dyu
Greek el
English en
Spanish es
Estonian et
Persian fa
Fulah ff
Finnish fi
French fr
Western Frisian fy
Irish ga
Scottish Gaelic gd
Galician gl
Gujarati gu
Hausa ha
Hebrew he
Hindi hi
Croatian hr
Haitian Creole ht
Hungarian hu
Armenian hy
Indonesian id
Igbo ig
Iloko ilo
Icelandic is
Italian it
Japanese ja
Javanese jv
Georgian ka
Kachin kac
Kamba kam
Kabuverdianu kea
Kongo kg
Kazakh kk
Central Khmer km
Kimbundu kmb
Northern Kurdish kmr
Kannada kn
Korean ko
Kurdish ku
Kyrgyz ky
Luxembourgish lb
Ganda lg
Lingala ln
Lao lo
Lithuanian lt
Luo luo
Latvian lv
Malagasy mg
Maori mi
Macedonian mk
Malayalam ml
Mongolian mn
Marathi mr
Malay ms
Maltese mt
Burmese my
Nepali ne
Dutch nl
Norwegian no
Northern Sotho ns
Nyanja ny
Occitan oc
Oromo om
Oriya or
Punjabi pa
Polish pl
Pashto ps
Portuguese pt
Quechua qu
Romanian ro
Russian ru
Sindhi sd
Shan shn
Sinhala si
Slovak sk
Slovenian sl
Shona sn
Somali so
Albanian sq
Serbian sr
Swati ss
Sundanese su
Swedish sv
Swahili sw
Tamil ta
Telugu te
Tajik tg
Thai th
Tigrinya ti
Tagalog tl
Tswana tn
Turkish tr
Ukrainian uk
Umbundu umb
Urdu ur
Uzbek uz
Vietnamese vi
Wolof wo
Xhosa xh
Yiddish yi
Yoruba yo
Chinese zh
Zulu zu