dumitrescustefan
/

mt5-base-romanian

Text2Text Generation

text-generation-inference

Model card Files Files and versions Community

mt5-base-romanian / README.md

dumitrescustefan's picture

dumitrescustefan

Update README.md

46d638b over 1 year ago

|

raw history blame contribute delete

No virus

2.09 kB

	---
	language: ro
	inference: false
	license: apache-2.0
	---

	This is a pretrained [MT5](https://github.com/google-research/multilingual-t5) base model (390M parameters).

	Training was performed with the span corruption task on a clean 80GB Romanian text corpus for 4M total steps with these [scripts](https://github.com/dumitrescustefan/t5x_models), starting from the 1M public mt5x-base checkpoint. The model was trained with an encoder sequence length of 512 and a decoder sequence length of 256; it has the same mt5x vocabulary as the 1M multilingual checkpoint.

	!! IMPORTANT !! This model was pretrained on the span corruption MLM task, meaning this model is not usable in any downstream task without finetuning first!

	### How to load an mt5x model

	```python
	from transformers import MT5Model, T5Tokenizer

	model = MT5Model.from_pretrained('dumitrescustefan/mt5-base-romanian')
	tokenizer = T5Tokenizer.from_pretrained('dumitrescustefan/mt5-base-romanian')
	input_text = "Acesta este un test."
	target_text = "Acesta este"
	inputs = tokenizer(input_text, return_tensors="pt")
	labels = tokenizer(text_target=target_text, return_tensors="pt")

	outputs = model(input_ids=inputs["input_ids"], decoder_input_ids=labels["input_ids"])
	hidden_states = outputs.last_hidden_state
	print(hidden_states.shape) # this will print [1, 4, 768]
	```

	Remember to always sanitize your text! Replace ``ş`` and ``ţ`` cedilla-letters to comma-letters with :
	```python
	text = text.replace("ţ", "ț").replace("ş", "ș").replace("Ţ", "Ț").replace("Ş", "Ș")
	```
	because the model was not trained on cedilla ``ş`` and ``ţ``s. If you don't, you will have decreased performance due to ``<UNK>``s and increased number of tokens per word.

	### Acknowledgements

	We'd like to thank [TPU Research Cloud](https://sites.research.google/trc/about/) for providing the TPUv4 cores we used to train these models!

	### Authors

	Yours truly,

	_[Stefan Dumitrescu](https://github.com/dumitrescustefan), [Mihai Ilie](https://github.com/iliemihai) and [Per Egil Kummervold](https://huggingface.co/north)_