Smugri-tuned NLLB-1.3b, v0.01

This is a fine-tune of NLLB-1.3b with parallel data for 29 Finno-Ugric languages. It supports different dialect/variety generation for some of the languages, more info below.

Info on used data and other details: soon. The training of this model is in progress, there are several known problems and overall quality is not tested yet. So far only parallel data was taken into training, more dialects are to come after monolingual/synthetic data is added.

Usage in Python, to translate from English to Veps (New written Veps dialect/variety):

from transformers import AutoModelForSeq2SeqLM, AutoTokenizer

model = AutoModelForSeq2SeqLM.from_pretrained("tartuNLP/nllb1.3-smugri4-v0.01")
tokenizer = AutoTokenizer.from_pretrained("tartuNLP/nllb1.3-smugri4-v0.01")

input_text = "<New written Veps> This is a short example sentence."
source_lang = "eng_Latn"
target_lang = "vep_Latn"

tokenizer.src_lang = source_lang

input_tokenized = tokenizer(input_text, return_tensors="pt")

output_raw = model.generate(**input_tokenized, forced_bos_token_id=tokenizer.convert_tokens_to_ids(target_lang))

output = tokenizer.decode(output_raw[0], skip_special_tokens=True)

print(output) # should be 'Nece om lühüd ozutezsana.'

# for '<Central Eastern Veps>' the output becomes 'Nece om lühüd naverz’ sanond.'

Supported languages

  • est_Latn (Estonian), fin_Latn (Finnish), fkv_Latn (Kven), izh_Latn (Izhorian*), krl_Latn (Proper Karelian*), liv_Latn (Livonian), lud_Latn (Ludian*), olo_Latn (Livvi-Karelian*), vep_Latn (Veps*), vot_Latn (Votic*), vro_Latn (Võro)
  • sje_Latn (Pite Sami), sju_Latn (Ume Sami), sma_Latn (Southern Sami), sme_Latn (Northern Sami), smj_Latn (Lule Sami), smn_Latn (Inari Sami), sms_Latn (Skolt Sami), sjd_Cyrl (Kildin Sami*)
  • kpv_Cyrl (Komi-Zyrian), koi_Cyrl (Komi-Permyak), udm_Cyrl (Udmurt)
  • mdf_Cyrl (Moksha), myv_Cyrl (Erzya)
  • mhr_Cyrl (Meadow Mari), mrj_Cyrl (Hill Mari)
  • hun_Latn (Hungarian), kca_Cyrl (Khanty*), mns_Cyrl (Mansi)
  • eng_Latn (English), lvs_Latn (Latvian), rus_Cyrl (Russian), nor_Latn (Norwegian)

Supported dialects

  • for Izhorian: alal (Lower Luga), soik (Soikkola)
  • for Votic: I, J, Ja, K, , Ke, Ko, L, Li, Lu, M, P, Po, R, Ra, S, U, V (explanation: https://arhiiv.eki.ee/dict/vadja/lisad/v_lyhendid.pdf)
  • for Karelian Proper: Dyorzha, Ilomantsi, Keret, Kestenga, Kontokki, Korbiselga, Maslozero, Myandyselga, New written Tver, New written karelian, Oulanga, Padany, Panozero, Poduzhemye, Porosozero, Reboly, Rugozero, Suistamo, Suoyarvi, Tikhtozero, Tikhvin, Tolmachi, Tunguda, Uhta, Valdai, Vesyegonsk, Voknavolok, Vychetaibola, Yushkozero
  • for Ludian: Central Ludian (Munozero), Mikhailovskoye, New written Ludian, Northern Ludian (Kondopoga), Southern Ludian (Svjatozero), Miikul (Central Ludian)
  • for Livvi-Karelian: Impilahti, Kondushi, Kotkozero, Nekkula, New written Livvic, Rypushkalitsa, Salmi, Suoyarvi, Syamozero, Tulmozero, Vedlozero, Vidlitsa
  • for Veps: Central Eastern Veps, Central Western Veps, New written Veps, Northern Veps, Southern Veps
  • for Kildin Sami: orth1
  • for Khanty: kazym (Kazym), shuryshkary (Shuryshkar)
Downloads last month
49
Safetensors
Model size
1.37B params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support