PaLe-MADLAD

The MADLAD-400 model fine-tuned to translate from Proper Karelian, Livvi, Ludian, and Veps to Russian and vice versa. We call this model Paragraph-Level as we trained it on paragraphs comprising multiple sentences. The model demonstrates the capacity to handle gender-neutral pronouns (presenting a major obstacle in translating from Finno-Ugric languages) and other discourse-level phenomena.

Example Usage for Inference

from transformers import AutoModelForSeq2SeqLM, AutoTokenizer

model = AutoModelForSeq2SeqLM.from_pretrained('tartuNLP/pale-madlad-mt')
tokenizer = AutoTokenizer.from_pretrained('tartuNLP/pale-madlad-mt')

# You need to explicitly prepend a target language tag to the input string in the format <2xx>, where xx stands for the language code.
# Language codes: 'krl' for Proper Karelian, 'lud' for Ludian, 'olo' for Livvi, 'vep' for Veps, 'ru' for Russian, 'en' for English.
text = '<2krl>' + 'Здравствуйте!'

inputs = tokenizer(text, return_tensors='pt').input_ids
outputs = model.generate(inputs)
tokenizer.decode(outputs[0], skip_special_tokens=True)
# Output: Terveh!

Please cite the following paper if you use this model in your work:

@inproceedings{
pashchenko2024paragraphlevel,
title={Paragraph-Level Machine Translation for Low-Resource Finno-Ugric Languages},
author={Dmytro Pashchenko and Lisa Yankovskaya and Mark Fishel},
booktitle={The Joint 25th Nordic Conference on Computational Linguistics and 11th Baltic Conference on Human Language Technologies},
year={2024},
url={https://openreview.net/forum?id=uTFJsQpNZk}
}