--- language: ro inference: false license: apache-2.0 --- This is a pretrained-from-scratch **T5v1.1 large** model (**783M** parameters) on the [t5x](https://github.com/google-research/t5x) platform. Training was performed on a clean 80GB Romanian text corpus for 4M steps with these [scripts](https://github.com/dumitrescustefan/t5x_models). The model was trained with an encoder and decoder sequence length of 512. **!! IMPORTANT !!** This model was pretrained on the span corruption MLM task, meaning this model is **not usable** in any downstream task **without finetuning** first! ### How to load a t5x model ```python from transformers import T5Tokenizer, T5Model tokenizer = T5Tokenizer.from_pretrained('dumitrescustefan/t5-v1_1-large-romanian') model = T5Model.from_pretrained('dumitrescustefan/t5-v1_1-large-romanian') input_ids = tokenizer("Acesta este un test", return_tensors="pt").input_ids # Batch size 1 decoder_input_ids = tokenizer("Acesta este", return_tensors="pt").input_ids # Batch size 1 # preprocess: Prepend decoder_input_ids with start token which is pad token for T5Model. # This is not needed for torch's T5ForConditionalGeneration as it does this internally using labels arg. decoder_input_ids = model._shift_right(decoder_input_ids) # forward pass outputs = model(input_ids=input_ids, decoder_input_ids=decoder_input_ids) last_hidden_states = outputs.last_hidden_state print(last_hidden_states.shape) # this will print [1, 3, 1024] ``` Remember to always sanitize your text! Replace ``ş`` and ``ţ`` cedilla-letters to comma-letters with : ```python text = text.replace("ţ", "ț").replace("ş", "ș").replace("Ţ", "Ț").replace("Ş", "Ș") ``` because the model was **not** trained on cedilla ``ş`` and ``ţ``s. If you don't, you will have decreased performance due to ````s and increased number of tokens per word. ### Acknowledgements We'd like to thank [TPU Research Cloud](https://sites.research.google/trc/about/) for providing the TPUv4 cores we used to train these models! ### Authors Yours truly, _[Stefan Dumitrescu](https://github.com/dumitrescustefan), [Mihai Ilie](https://github.com/iliemihai) and [Per Egil Kummervold](https://huggingface.co/north)_