|
--- |
|
license: mit |
|
|
|
language: |
|
- en |
|
|
|
widget: |
|
- text: "Let us translate some text from Livonian to Võro!" |
|
--- |
|
|
|
# NMT for Finno-Ugric Languages |
|
|
|
This is an NMT system for translating between Võro, Livonian, North Sami, South Sami as well as Estonian, Finnish, Latvian and English. It was created by fine-tuning Facebook's m2m100-418M on the liv4ever and smugri datasets. |
|
|
|
## Tokenizer |
|
Four language codes were added to the tokenizer: __liv__, __vro__, __sma__ and __sme__. Currently the m2m100 tokenizer loads the list of languages from a hard-coded list, so it has to be updated after loading; see the code example below. |
|
|
|
## Usage example |
|
Install the transformers and sentencepiece libraries: `pip install sentencepiece transformers` |
|
|
|
```from transformers import AutoModelForSeq2SeqLM, AutoTokenizer |
|
|
|
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM |
|
tokenizer = AutoTokenizer.from_pretrained("tartuNLP/m2m100_418M_smugri") |
|
#Fix the language codes in the tokenizer |
|
tokenizer.id_to_lang_token = dict(list(tokenizer.id_to_lang_token.items()) + list(tokenizer.added_tokens_decoder.items())) |
|
tokenizer.lang_token_to_id = dict(list(tokenizer.lang_token_to_id.items()) + list(tokenizer.added_tokens_encoder.items())) |
|
tokenizer.lang_code_to_token = { k.replace("_", ""): k for k in tokenizer.additional_special_tokens } |
|
tokenizer.lang_code_to_id = { k.replace("_", ""): v for k, v in tokenizer.lang_token_to_id.items() } |
|
|
|
model = AutoModelForSeq2SeqLM.from_pretrained("tartuNLP/m2m100_418M_smugri") |
|
|
|
tokenizer.src_lang = 'liv' |
|
|
|
encoded_src = tokenizer("Līvõ kēļ jelāb!", return_tensors="pt") |
|
|
|
encoded_out = model.generate(**encoded_src, forced_bos_token_id = tokenizer.get_lang_id("sme")) |
|
print(tokenizer.batch_decode(encoded_out, skip_special_tokens=True)) |
|
``` |
|
|
|
The output is `Livčča giella eallá.` |