dumitrescustefan/mt5-large-romanian

This is a pretrained MT5 large model (973M parameters).

Training was performed with the span corruption task on a clean 80GB Romanian text corpus for 4M total steps with these scripts, starting from the 1M public mt5x-large checkpoint. The model was trained with an encoder and decoder sequence length of 512, and has the same mt5x vocabulary as the 1M multilingual checkpoint.

!! IMPORTANT !! This model was pretrained on the span corruption MLM task, meaning this model is not usable in any downstream task without finetuning first!

How to load an mt5x model

from transformers import MT5Model, T5Tokenizer

model = MT5Model.from_pretrained('dumitrescustefan/mt5-large-romanian')
tokenizer = T5Tokenizer.from_pretrained('dumitrescustefan/mt5-large-romanian')
input_text = "Acesta este un test."
target_text = "Acesta este"
inputs = tokenizer(input_text, return_tensors="pt")
labels = tokenizer(text_target=target_text, return_tensors="pt")

outputs = model(input_ids=inputs["input_ids"], decoder_input_ids=labels["input_ids"])
hidden_states = outputs.last_hidden_state
print(hidden_states.shape)  # this will print [1, 4, 1024]

Remember to always sanitize your text! Replace ş and ţ cedilla-letters to comma-letters with :

text = text.replace("ţ", "ț").replace("ş", "ș").replace("Ţ", "Ț").replace("Ş", "Ș")

because the model was not trained on cedilla ş and ţs. If you don't, you will have decreased performance due to <UNK>s and increased number of tokens per word.

Acknowledgements

We'd like to thank TPU Research Cloud for providing the TPUv4 cores we used to train these models!

Authors

Yours truly,

Stefan Dumitrescu, Mihai Ilie and Per Egil Kummervold