File size: 7,211 Bytes

---
language:
- af
- am
- ar
- az
- be
- bg
- bn
- ca
- ceb
- co
- cs
- cy
- da
- de
- el
- en
- eo
- es
- et
- eu
- fa
- fi
- fil
- fr
- fy
- ga
- gd
- gl
- gu
- ha
- haw
- he
- hi
- hmn
- ht
- hu
- hy
- id
- ig
- is
- it
- iw
- ja
- jv
- ka
- kk
- km
- kn
- ko
- ku
- ky
- la
- lb
- lo
- lt
- lv
- mg
- mi
- mk
- ml
- mn
- mr
- ms
- mt
- my
- ne
- nl
- 'no'
- ny
- pa
- pl
- ps
- pt
- ro
- ru
- sd
- si
- sk
- sl
- sm
- sn
- so
- sq
- sr
- st
- su
- sv
- sw
- ta
- te
- tg
- th
- tr
- uk
- und
- ur
- uz
- vi
- xh
- yi
- yo
- zh
- zu
license: mit
datasets:
- mc4
---

# MyT5



## Model Details

MyT5 (**My**te **T5**) is a multilingual language model based on T5 architecture.
The model uses a **m**orphologically-driven **byte** (**MYTE**) representation described in our paper [Limisiewicz et al., 2024](https://arxiv.org/pdf/2403.10691.pdf).

### Model Description

<!-- Provide a longer summary of what this model is. -->

- **Developed by:** Tomasz Limisiewicz, Terra Blevins, Hila Gonen, Orevaoghene Ahia, Luke Zettlemoyer
- **Funded by:** University of Washington Fellowship, Charles University Grant Agency
- **Model type:** T5
- **Language(s) (NLP):** Multilingual
- **License:** MIT

### Model Sizes

- **[Small](https://huggingface.co/Tomlim/myt5-small)**: 300M parameters
- **[Base](https://huggingface.co/Tomlim/myt5-base)**: 582M parameters
- **[Large](https://huggingface.co/Tomlim/myt5-large)**: 1.2B parameters
  
### Model Sources 

<!-- Provide the basic links for the model. -->

- **[Repository](https://github.com/tomlimi/MYTE)** 
- **[Paper](https://arxiv.org/pdf/2403.10691.pdf)** 

## How to Get Started with the Model

The snippet below shows the basic usage of the model for multilingual language modeling.
Custom Tokenizer is available in [GitHub](https://github.com/tomlimi/MYTE])repository, in `src/myt5/myt5_tokenizer.py`.
We also plan to release it on HuggingFace in the future.

```python
from transformers import T5ForConditionalGeneration
from src.myt5.myt5_tokenizer import MyT5Tokenizer
import torch

MODEL_SIZE = "large" # small, base, or large

model = T5ForConditionalGeneration.from_pretrained(f"Tomlim/MyT5_{MODEL_SIZE}", use_safetensors=True)
tokenizer = MyT5Tokenizer()

pre_texts = ['"We now have',
            '„Mamy teraz myszy w wieku',
            '"""எங்களிடம் இப்போது']
post_texts = ['4-month-old mice that are non-diabetic that used to be diabetic," he added.',
              '4 miesięcy, które miały cukrzycę, ale zostały z niej wyleczone” – dodał.',
              '4-மாத-வயதுடைய எலி ஒன்று உள்ளது, முன்னர் அதற்கு நீரிழிவு இருந்தது தற்போது இல்லை"" என்று அவர் மேலும் கூறினார்."']

inputs = tokenizer(pre_texts, padding="longest", return_tensors="pt")
targets = tokenizer(post_texts, padding="longest", return_tensors="pt")


outputs = model(**inputs, labels=targets.input_ids)
probs = torch.nn.functional.softmax(outputs.logits, dim=-1)
```

## Training Details

### Training Data

The model was trained on the standard T5 task of restoring corrupted spans in the multilingual MC4 dataset.

### Preprocessing

Instead of UTF-8 bytes, we used morphologically-driven byte representation.
See the description in our [paper](https://arxiv.org/pdf/2403.10691.pdf) for more details.


### Training Hyperparameters

We used the same hyperparameters as in the original ByT5 paper.
The only difference is that we decreased the number of training steps to 250,000 to avoid overfiting.

### Computational Infrastructure

Models were trained on TPUs available through TPU Research Cloud (TRC).
We used v3-8 TPU for training small and base models and v3-32 for a large model.
The training for each instance took:

- **Small**: 90h
- **Base**: 230h
- **Large**: 190h

# Evaluation

<!-- This section describes the evaluation protocols and provides the results. -->

MyT5 models are compared with reimplementation of [ByT5](https://huggingface.co/docs/transformers/model_doc/byt5) models trained for 250,000 steps. 

## Language Modeling

We have evaluated LM performance on multi-parallel [FLORES 200](https://arxiv.org/pdf/2207.04672v3.pdf) corpus.
To compare the scores across languages and models, we used a normalized metric, i.e., Bit-per-English-Byte (BPEB).

### Results

|       |           | ByT5 |        | MyT5 |        |
|-------|-----------|------|--------|------|--------|
|       |           | BPEB | T (ms) | BPEB | T (ms) |
| small | All       | 10.1 | 7.0    | 4.6  | 6.7    |
|       | Latin     | 4.6  | 5.9    | 4.2  | 6.6    |
|       | Non Latin | 18.1 | 8.5    | 5.1  | 6.8    |
| base  | All       | 8.2  | 11.5   | 5.8  | 8.9    |
|       | Latin     | 4.9  | 9.4    | 5.0  | 8.7    |
|       | Non Latin | 13.0 | 14.6   | 6.9  | 9.1    |
| large | All       | 13.4 | 31.8   | 4.6  | 26.7   |
|       | Latin     | 10.1 | 28.1   | 4.0  | 26.6   |
|       | Non Latin | 18.2 | 37.3   | 5.4  | 27.0   |

Byte-per-English-Bits and Inference times (average per Flores 200 sentence) averaged for three language groupings. 
The inference was run on an A40 GPU core.

## Downstream Tasks

We tested the large model in four end-tasks: question answering, NER, semantic parsing, and machine translation.
The test data come from XTREME-UP benchmark ([Ruder, Clark et al., 2023](https://arxiv.org/pdf/2305.11938.pdf)), which covers mainly low-resource languages

### Fine-tuning

In each task, we fine-tuned for all languages jointly.
We used 1e-3 learning rate with square root decay and dropout of 0.1.
The batch size and training varied across tasks:

- **NER**: 128 examples per batch, 6000 steps
- **QA**: 64 examples per batch, 6500 steps
- **Semantic Parsing**: 64 examples per batch, 1000 steps
- **MT**: 64 examples per batch, 10000 steps


### Results

 Task       | QA (F1)  | NER (F1) | Semantic Parsing (EM)| MT (chrF) 
------------|------|------|------------------|------ 
 Flan-PaLM* | 22.9 | 12.0 | 0.1              | ---  
 mT5*       | 59.7 | 74.0 | 21.8             | ---  
 ByT5       | 73.2 | 81.5 | 25.1             | 20.1 
 MyT5       | 75.3 | 80.8 | 19.6             | 20.4 
Inference Times  per example (ms)
 ByT5       | 36.2 | 13.8 | 13.2             | 15.9 
 MyT5       | 35.6 | 12.6 | 12.4             | 12.6 

The average result of XTREME-UP tasks across low-resource languages.
The baseline results of mT5 and Flan-PaLM (in-context-learning evaluation) are reported in [Ruder, Clark et al., 2023](https://arxiv.org/pdf/2305.11938.pdf). 
The reported inference time is an average across evaluation examples; the inference was run on an A40 GPU core.

## Citation

```bibtex
@misc{limisiewicz2024myte,
      title={MYTE: Morphology-Driven Byte Encoding for Better and Fairer Multilingual Language Modeling}, 
      author={Tomasz Limisiewicz and Terra Blevins and Hila Gonen and Orevaoghene Ahia and Luke Zettlemoyer},
      year={2024},
      eprint={2403.10691},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}
```


## Model Card Author

[Tomasz Limisiewicz](mailto:limisewicz@ufal.mff.cuni.cz)