metadata

language:
  - af
  - am
  - ar
  - az
  - be
  - bg
  - bn
  - ca
  - ceb
  - co
  - cs
  - cy
  - da
  - de
  - el
  - en
  - eo
  - es
  - et
  - eu
  - fa
  - fi
  - fil
  - fr
  - fy
  - ga
  - gd
  - gl
  - gu
  - ha
  - haw
  - he
  - hi
  - hmn
  - ht
  - hu
  - hy
  - id
  - ig
  - is
  - it
  - iw
  - ja
  - jv
  - ka
  - kk
  - km
  - kn
  - ko
  - ku
  - ky
  - la
  - lb
  - lo
  - lt
  - lv
  - mg
  - mi
  - mk
  - ml
  - mn
  - mr
  - ms
  - mt
  - my
  - ne
  - nl
  - 'no'
  - ny
  - pa
  - pl
  - ps
  - pt
  - ro
  - ru
  - sd
  - si
  - sk
  - sl
  - sm
  - sn
  - so
  - sq
  - sr
  - st
  - su
  - sv
  - sw
  - ta
  - te
  - tg
  - th
  - tr
  - uk
  - und
  - ur
  - uz
  - vi
  - xh
  - yi
  - yo
  - zh
  - zu
license: mit
datasets:
  - mc4

MyT5

Model Details

MyT5 (Myte T5) is a multilingual language model based on T5 architecture. The model uses a morphologically-driven byte (MYTE) representation described in our paper Limisiewicz et al., 2024.

Model Description

Developed by: Tomasz Limisiewicz, Terra Blevins, Hila Gonen, Orevaoghene Ahia, Luke Zettlemoyer
Funded by: University of Washington Fellowship, Charles University Grant Agency
Model type: T5
Language(s) (NLP): Multilingual
License: MIT

Model Sizes

Small: 300M parameters
Base: 582M parameters
Large: 1.2B parameters

Model Sources

Repository
Paper

How to Get Started with the Model

The snippet below shows the basic usage of the model for multilingual language modeling. Custom Tokenizer is available in GitHubrepository, in src/myt5/myt5_tokenizer.py. We also plan to release it on HuggingFace in the future.

from transformers import T5ForConditionalGeneration
from src.myt5.myt5_tokenizer import MyT5Tokenizer
import torch

MODEL_SIZE = "large" # small, base, or large

model = T5ForConditionalGeneration.from_pretrained(f"Tomlim/MyT5_{MODEL_SIZE}", use_safetensors=True)
tokenizer = MyT5Tokenizer()

pre_texts = ['"We now have',
            '„Mamy teraz myszy w wieku',
            '"""எங்களிடம் இப்போது']
post_texts = ['4-month-old mice that are non-diabetic that used to be diabetic," he added.',
              '4 miesięcy, które miały cukrzycę, ale zostały z niej wyleczone” – dodał.',
              '4-மாத-வயதுடைய எலி ஒன்று உள்ளது, முன்னர் அதற்கு நீரிழிவு இருந்தது தற்போது இல்லை"" என்று அவர் மேலும் கூறினார்."']

inputs = tokenizer(pre_texts, padding="longest", return_tensors="pt")
targets = tokenizer(post_texts, padding="longest", return_tensors="pt")


outputs = model(**inputs, labels=targets.input_ids)
probs = torch.nn.functional.softmax(outputs.logits, dim=-1)

Training Details

Training Data

The model was trained on the standard T5 task of restoring corrupted spans in the multilingual MC4 dataset.

Preprocessing

Instead of UTF-8 bytes, we used morphologically-driven byte representation. See the description in our paper for more details.

Training Hyperparameters

We used the same hyperparameters as in the original ByT5 paper. The only difference is that we decreased the number of training steps to 250,000 to avoid overfiting.

Computational Infrastructure

Models were trained on TPUs available through TPU Research Cloud (TRC). We used v3-8 TPU for training small and base models and v3-32 for a large model. The training for each instance took:

Small: 90h
Base: 230h
Large: 190h

Evaluation

MyT5 models are compared with reimplementation of ByT5 models trained for 250,000 steps.

Language Modeling

We have evaluated LM performance on multi-parallel FLORES 200 corpus. To compare the scores across languages and models, we used a normalized metric, i.e., Bit-per-English-Byte (BPEB).

Results

		ByT5		MyT5
		BPEB	T (ms)	BPEB	T (ms)
small	All	10.1	7.0	4.6	6.7
	Latin	4.6	5.9	4.2	6.6
	Non Latin	18.1	8.5	5.1	6.8
base	All	8.2	11.5	5.8	8.9
	Latin	4.9	9.4	5.0	8.7
	Non Latin	13.0	14.6	6.9	9.1
large	All	13.4	31.8	4.6	26.7
	Latin	10.1	28.1	4.0	26.6
	Non Latin	18.2	37.3	5.4	27.0

Byte-per-English-Bits and Inference times (average per Flores 200 sentence) averaged for three language groupings. The inference was run on an A40 GPU core.

Downstream Tasks

We tested the large model in four end-tasks: question answering, NER, semantic parsing, and machine translation. The test data come from XTREME-UP benchmark (Ruder, Clark et al., 2023), which covers mainly low-resource languages

Fine-tuning

In each task, we fine-tuned for all languages jointly. We used 1e-3 learning rate with square root decay and dropout of 0.1. The batch size and training varied across tasks:

NER: 128 examples per batch, 6000 steps
QA: 64 examples per batch, 6500 steps
Semantic Parsing: 64 examples per batch, 1000 steps
MT: 64 examples per batch, 10000 steps

Results

Task	QA (F1)	NER (F1)	Semantic Parsing (EM)	MT (chrF)
Flan-PaLM*	22.9	12.0	0.1	---
mT5*	59.7	74.0	21.8	---
ByT5	73.2	81.5	25.1	20.1
MyT5	75.3	80.8	19.6	20.4
Inference Times per example (ms)
ByT5	36.2	13.8	13.2	15.9
MyT5	35.6	12.6	12.4	12.6

The average result of XTREME-UP tasks across low-resource languages. The baseline results of mT5 and Flan-PaLM (in-context-learning evaluation) are reported in Ruder, Clark et al., 2023. The reported inference time is an average across evaluation examples; the inference was run on an A40 GPU core.

Citation

@misc{limisiewicz2024myte,
      title={MYTE: Morphology-Driven Byte Encoding for Better and Fairer Multilingual Language Modeling}, 
      author={Tomasz Limisiewicz and Terra Blevins and Hila Gonen and Orevaoghene Ahia and Luke Zettlemoyer},
      year={2024},
      eprint={2403.10691},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}

Model Card Author

Tomasz Limisiewicz--- license: mit language: - af - am - ar - az - be - bg - bn - ca - ceb - co - cs - cy - da - de - el - en - eo - es - et - eu - fa - fi - fil - fr - fy - ga - gd - gl - gu - ha - haw - he - hi - hmn - ht - hu - hy - id - ig - is - it - iw - ja - jv - ka - kk - km - kn - ko - ku - ky - la - lb - lo - lt - lv - mg - mi - mk - ml - mn - mr - ms - mt - my - ne - nl - 'no' - ny - pa - pl - ps - pt - ro - ru - sd - si - sk - sl - sm - sn - so - sq - sr - st - su - sv - sw - ta - te - tg - th - tr - uk - und - ur - uz - vi - xh - yi - yo - zh - zu datasets: - mc4

MyT5

Model Details

MyT5 (Myte T5) is a multilingual language model based on T5 architecture. The model uses a morphologically-driven byte (MYTE) representation described in our paper Limisiewicz et al., 2024.

Model Description

Developed by: Tomasz Limisiewicz, Terra Blevins, Hila Gonen, Orevaoghene Ahia, Luke Zettlemoyer
Funded by: University of Washington Fellowship, Charles University Grant Agency
Model type: T5
Language(s) (NLP): Multilingual
License: MIT

Model Sizes

Small: 300M parameters
Base: 582M parameters
Large: 1.2B parameters

Model Sources

Repository
Paper

How to Get Started with the Model

from transformers import T5ForConditionalGeneration
from src.myt5.myt5_tokenizer import MyT5Tokenizer
import torch

MODEL_SIZE = "large" # small, base, or large

model = T5ForConditionalGeneration.from_pretrained(f"Tomlim/MyT5_{MODEL_SIZE}", use_safetensors=True)
tokenizer = MyT5Tokenizer()

pre_texts = ['"We now have',
            '„Mamy teraz myszy w wieku',
            '"""எங்களிடம் இப்போது']
post_texts = ['4-month-old mice that are non-diabetic that used to be diabetic," he added.',
              '4 miesięcy, które miały cukrzycę, ale zostały z niej wyleczone” – dodał.',
              '4-மாத-வயதுடைய எலி ஒன்று உள்ளது, முன்னர் அதற்கு நீரிழிவு இருந்தது தற்போது இல்லை"" என்று அவர் மேலும் கூறினார்."']

inputs = tokenizer(pre_texts, padding="longest", return_tensors="pt")
targets = tokenizer(post_texts, padding="longest", return_tensors="pt")


outputs = model(**inputs, labels=targets.input_ids)
probs = torch.nn.functional.softmax(outputs.logits, dim=-1)

Training Details

Training Data

The model was trained on the standard T5 task of restoring corrupted spans in the multilingual MC4 dataset.

Preprocessing

Instead of UTF-8 bytes, we used morphologically-driven byte representation. See the description in our paper for more details.

Training Hyperparameters

We used the same hyperparameters as in the original ByT5 paper. The only difference is that we decreased the number of training steps to 250,000 to avoid overfiting.

Computational Infrastructure

Models were trained on TPUs available through TPU Research Cloud (TRC). We used v3-8 TPU for training small and base models and v3-32 for a large model. The training for each instance took:

Small: 90h
Base: 230h
Large: 190h

Evaluation

MyT5 models are compared with reimplementation of ByT5 models trained for 250,000 steps.

Language Modeling

We have evaluated LM performance on multi-parallel FLORES 200 corpus. To compare the scores across languages and models, we used a normalized metric, i.e., Bit-per-English-Byte (BPEB).

Results

		ByT5		MyT5
		BPEB	T (ms)	BPEB	T (ms)
small	All	10.1	7.0	4.6	6.7
	Latin	4.6	5.9	4.2	6.6
	Non Latin	18.1	8.5	5.1	6.8
base	All	8.2	11.5	5.8	8.9
	Latin	4.9	9.4	5.0	8.7
	Non Latin	13.0	14.6	6.9	9.1
large	All	13.4	31.8	4.6	26.7
	Latin	10.1	28.1	4.0	26.6
	Non Latin	18.2	37.3	5.4	27.0

Byte-per-English-Bits and Inference times (average per Flores 200 sentence) averaged for three language groupings. The inference was run on an A40 GPU core.

Citation

@misc{limisiewicz2024myte,
      title={MYTE: Morphology-Driven Byte Encoding for Better and Fairer Multilingual Language Modeling}, 
      author={Tomasz Limisiewicz and Terra Blevins and Hila Gonen and Orevaoghene Ahia and Luke Zettlemoyer},
      year={2024},
      eprint={2403.10691},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}

Model Card Author

Tomasz Limisiewicz