Edit model card

smugri3_14

The TartuNLP Multilingual Neural Machine Translation model for low-resource Finno-Ugric languages. The model can translate in 702 directions, between 27 languages.

Languages Supported

  • High and Mid-Resource Languages: Estonian, English, Finnish, Hungarian, Latvian, Norwegian, Russian
  • Low-Resource Finno-Ugric Languages: Komi, Komi Permyak, Udmurt, Hill Mari, Meadow Mari, Erzya, Moksha, Proper Karelian, Livvi Karelian, Ludian, Võro, Veps, Livonian, Northern Sami, Southern Sami, Inari Sami, Lule Sami, Skolt Sami, Mansi, Khanty

Usage

The model can be tested in our web demo.

To use this model for translation tasks, you will need to utilize the Fairseq v0.12.2.

Bash script example:

# Define target and source languages
src_lang="eng_Latn"
tgt_lang="kpv_Cyrl"

# Directories and paths
model_path=./smugri3_14-finno-ugric-nmt
checkpoint_path=${model_path}/smugri3_14.pt
sp_path=${model_path}/flores200_sacrebleu_tokenizer_spm.ext.model
dictionary_path=${model_path}/nllb_model_dict.ext.txt

# Language settings for fairseq
nllb_langs="eng_Latn,est_Latn,fin_Latn,hun_Latn,lvs_Latn,nob_Latn,rus_Cyrl"
new_langs="kca_Cyrl,koi_Cyrl,kpv_Cyrl,krl_Latn,liv_Latn,lud_Latn,mdf_Cyrl,mhr_Cyrl,mns_Cyrl,mrj_Cyrl,myv_Cyrl,olo_Latn,sma_Latn,sme_Latn,smj_Latn,smn_Latn,sms_Latn,udm_Cyrl,vep_Latn,vro_Latn"

# Start fairseq-interactive in interactive mode
fairseq-interactive ${model_path} \
  -s ${src_lang} -t ${tgt_lang} \
  --path ${checkpoint_path} --max-tokens 20000 --buffer-size 1 \
  --beam 4 --lenpen 1.0 \
  --bpe sentencepiece \
  --remove-bpe \
  --lang-tok-style multilingual \
  --sentencepiece-model ${sp_path} \
  --fixed-dictionary ${dictionary_path} \
  --task translation_multi_simple_epoch \
  --decoder-langtok --encoder-langtok src \
  --lang-pairs ${src_lang}-${tgt_lang} \
  --langs "${nllb_langs},${new_langs}" \
  --cpu

Scores

Average:

to-lang bleu chrf chrf++
ru 24.82 51.81 49.08
en 28.24 55.91 53.73
et 18.66 51.72 47.69
fi 15.45 50.04 45.38
hun 16.73 47.38 44.19
lv 18.15 49.04 45.54
nob 14.43 45.64 42.29
kpv 10.73 42.34 38.50
liv 5.16 29.95 27.28
mdf 5.27 37.66 32.99
mhr 8.51 43.42 38.76
mns 2.45 27.75 24.03
mrj 7.30 40.81 36.40
myv 4.72 38.74 33.80
olo 4.63 34.43 30.00
udm 7.50 40.07 35.72
krl 9.39 42.74 38.24
vro 8.64 39.89 35.97
vep 6.73 38.15 33.91
lud 3.11 31.50 27.30

All direction scores.

Evaluated with Smugri Flores testset.

Downloads last month
0
Inference Examples
Inference API (serverless) does not yet support fairseq models for this pipeline type.