--- language: - af - am - ar - az - be - bg - bn - ca - ceb - co - cs - cy - da - de - el - en - eo - es - et - eu - fa - fi - fil - fr - fy - ga - gd - gl - gu - ha - haw - he - hi - hmn - ht - hu - hy - id - ig - is - it - iw - ja - jv - ka - kk - km - kn - ko - ku - ky - la - lb - lo - lt - lv - mg - mi - mk - ml - mn - mr - ms - mt - my - ne - nl - 'no' - ny - pa - pl - ps - pt - ro - ru - sd - si - sk - sl - sm - sn - so - sq - sr - st - su - sv - sw - ta - te - tg - th - tr - uk - und - ur - uz - vi - xh - yi - yo - zh - zu license: mit datasets: - mc4 --- # MyT5 ## Model Details MyT5 (**My**te **T5**) is a multilingual language model based on T5 architecture. The model uses a **m**orphologically-driven **byte** (**MYTE**) representation described in our paper [Limisiewicz et al., 2024](https://arxiv.org/pdf/2403.10691.pdf). ### Model Description - **Developed by:** Tomasz Limisiewicz, Terra Blevins, Hila Gonen, Orevaoghene Ahia, Luke Zettlemoyer - **Funded by:** University of Washington Fellowship, Charles University Grant Agency - **Model type:** T5 - **Language(s) (NLP):** Multilingual - **License:** MIT ### Model Sizes - **[Small](https://huggingface.co/Tomlim/myt5-small)**: 300M parameters - **[Base](https://huggingface.co/Tomlim/myt5-base)**: 582M parameters - **[Large](https://huggingface.co/Tomlim/myt5-large)**: 1.2B parameters ### Model Sources - **[Repository](https://github.com/tomlimi/MYTE)** - **[Paper](https://arxiv.org/pdf/2403.10691.pdf)** ## How to Get Started with the Model The snippet below shows the basic usage of the model for multilingual language modeling. Custom Tokenizer is available in [GitHub](https://github.com/tomlimi/MYTE])repository, in `src/myt5/myt5_tokenizer.py`. We also plan to release it on HuggingFace in the future. ```python from transformers import T5ForConditionalGeneration from src.myt5.myt5_tokenizer import MyT5Tokenizer import torch MODEL_SIZE = "large" # small, base, or large model = T5ForConditionalGeneration.from_pretrained(f"Tomlim/MyT5_{MODEL_SIZE}", use_safetensors=True) tokenizer = MyT5Tokenizer() pre_texts = ['"We now have', '„Mamy teraz myszy w wieku', '"""எங்களிடம் இப்போது'] post_texts = ['4-month-old mice that are non-diabetic that used to be diabetic," he added.', '4 miesięcy, które miały cukrzycę, ale zostały z niej wyleczone” – dodał.', '4-மாத-வயதுடைய எலி ஒன்று உள்ளது, முன்னர் அதற்கு நீரிழிவு இருந்தது தற்போது இல்லை"" என்று அவர் மேலும் கூறினார்."'] inputs = tokenizer(pre_texts, padding="longest", return_tensors="pt") targets = tokenizer(post_texts, padding="longest", return_tensors="pt") outputs = model(**inputs, labels=targets.input_ids) probs = torch.nn.functional.softmax(outputs.logits, dim=-1) ``` ## Training Details ### Training Data The model was trained on the standard T5 task of restoring corrupted spans in the multilingual MC4 dataset. ### Preprocessing Instead of UTF-8 bytes, we used morphologically-driven byte representation. See the description in our [paper](https://arxiv.org/pdf/2403.10691.pdf) for more details. ### Training Hyperparameters We used the same hyperparameters as in the original ByT5 paper. The only difference is that we decreased the number of training steps to 250,000 to avoid overfiting. ### Computational Infrastructure Models were trained on TPUs available through TPU Research Cloud (TRC). We used v3-8 TPU for training small and base models and v3-32 for a large model. The training for each instance took: - **Small**: 90h - **Base**: 230h - **Large**: 190h # Evaluation MyT5 models are compared with reimplementation of [ByT5](https://huggingface.co/docs/transformers/model_doc/byt5) models trained for 250,000 steps. ## Language Modeling We have evaluated LM performance on multi-parallel [FLORES 200](https://arxiv.org/pdf/2207.04672v3.pdf) corpus. To compare the scores across languages and models, we used a normalized metric, i.e., Bit-per-English-Byte (BPEB). ### Results | | | ByT5 | | MyT5 | | |-------|-----------|------|--------|------|--------| | | | BPEB | T (ms) | BPEB | T (ms) | | small | All | 10.1 | 7.0 | 4.6 | 6.7 | | | Latin | 4.6 | 5.9 | 4.2 | 6.6 | | | Non Latin | 18.1 | 8.5 | 5.1 | 6.8 | | base | All | 8.2 | 11.5 | 5.8 | 8.9 | | | Latin | 4.9 | 9.4 | 5.0 | 8.7 | | | Non Latin | 13.0 | 14.6 | 6.9 | 9.1 | | large | All | 13.4 | 31.8 | 4.6 | 26.7 | | | Latin | 10.1 | 28.1 | 4.0 | 26.6 | | | Non Latin | 18.2 | 37.3 | 5.4 | 27.0 | Byte-per-English-Bits and Inference times (average per Flores 200 sentence) averaged for three language groupings. The inference was run on an A40 GPU core. ## Citation ```bibtex @misc{limisiewicz2024myte, title={MYTE: Morphology-Driven Byte Encoding for Better and Fairer Multilingual Language Modeling}, author={Tomasz Limisiewicz and Terra Blevins and Hila Gonen and Orevaoghene Ahia and Luke Zettlemoyer}, year={2024}, eprint={2403.10691}, archivePrefix={arXiv}, primaryClass={cs.CL} } ``` ## Model Card Author [Tomasz Limisiewicz](mailto:limisewicz@ufal.mff.cuni.cz)