--- license: cc-by-4.0 language: - et - fi - kv - hu - lv - 'no' library_name: fairseq metrics: - bleu - chrf pipeline_tag: translation --- # smugri3_14 The TartuNLP Multilingual Neural Machine Translation model for low-resource Finno-Ugric languages. The model can translate in 702 directions, between 27 languages. ### Languages Supported - **High and Mid-Resource Languages:** Estonian, English, Finnish, Hungarian, Latvian, Norwegian, Russian - **Low-Resource Finno-Ugric Languages:** Komi, Komi Permyak, Udmurt, Hill Mari, Meadow Mari, Erzya, Moksha, Proper Karelian, Livvi Karelian, Ludian, Võro, Veps, Livonian, Northern Sami, Southern Sami, Inari Sami, Lule Sami, Skolt Sami, Mansi, Khanty ### Usage The model can be tested in our [web demo](https://translate.ut.ee/). To use this model for translation tasks, you will need to utilize the [**Fairseq v0.12.2**](https://pypi.org/project/fairseq/0.12.2/). Bash script example: ``` # Define target and source languages src_lang="eng_Latn" tgt_lang="kpv_Cyrl" # Directories and paths model_path=./smugri3_14-finno-ugric-nmt checkpoint_path=${model_path}/smugri3_14.pt sp_path=${model_path}/flores200_sacrebleu_tokenizer_spm.ext.model dictionary_path=${model_path}/nllb_model_dict.ext.txt # Language settings for fairseq nllb_langs="eng_Latn,est_Latn,fin_Latn,hun_Latn,lvs_Latn,nob_Latn,rus_Cyrl" new_langs="kca_Cyrl,koi_Cyrl,kpv_Cyrl,krl_Latn,liv_Latn,lud_Latn,mdf_Cyrl,mhr_Cyrl,mns_Cyrl,mrj_Cyrl,myv_Cyrl,olo_Latn,sma_Latn,sme_Latn,smj_Latn,smn_Latn,sms_Latn,udm_Cyrl,vep_Latn,vro_Latn" # Start fairseq-interactive in interactive mode fairseq-interactive ${model_path} \ -s ${src_lang} -t ${tgt_lang} \ --path ${checkpoint_path} --max-tokens 20000 --buffer-size 1 \ --beam 4 --lenpen 1.0 \ --bpe sentencepiece \ --remove-bpe \ --lang-tok-style multilingual \ --sentencepiece-model ${sp_path} \ --fixed-dictionary ${dictionary_path} \ --task translation_multi_simple_epoch \ --decoder-langtok --encoder-langtok src \ --lang-pairs ${src_lang}-${tgt_lang} \ --langs "${nllb_langs},${new_langs}" \ --cpu ``` ### Scores Average: | to-lang | bleu | chrf | chrf++ | | ------- | ----- | ---- | ------ | | ru | 24.82 | 51.81 | 49.08 | | en | 28.24 | 55.91 | 53.73 | | et | 18.66 | 51.72 | 47.69 | | fi | 15.45 | 50.04 | 45.38 | | hun | 16.73 | 47.38 | 44.19 | | lv | 18.15 | 49.04 | 45.54 | | nob | 14.43 | 45.64 | 42.29 | | kpv | 10.73 | 42.34 | 38.50 | | liv | 5.16 | 29.95 | 27.28 | | mdf | 5.27 | 37.66 | 32.99 | | mhr | 8.51 | 43.42 | 38.76 | | mns | 2.45 | 27.75 | 24.03 | | mrj | 7.30 | 40.81 | 36.40 | | myv | 4.72 | 38.74 | 33.80 | | olo | 4.63 | 34.43 | 30.00 | | udm | 7.50 | 40.07 | 35.72 | | krl | 9.39 | 42.74 | 38.24 | | vro | 8.64 | 39.89 | 35.97 | | vep | 6.73 | 38.15 | 33.91 | | lud | 3.11 | 31.50 | 27.30 | [All direction scores](https://docs.google.com/spreadsheets/d/1H-hLAvIxJ5TbMmECZqza6G5jfAjh90pmJdszwajwHiI/). Evaluated with [Smugri Flores testset](https://huggingface.co/datasets/tartuNLP/smugri-flores-testset).