Text2Text Generation
Transformers
Safetensors
t5
text-generation-inference
Inference Endpoints
myt5-base / README.md
Tomlim's picture
Upload T5ForConditionalGeneration
b32af35 verified
metadata
language:
  - af
  - am
  - ar
  - az
  - be
  - bg
  - bn
  - ca
  - ceb
  - co
  - cs
  - cy
  - da
  - de
  - el
  - en
  - eo
  - es
  - et
  - eu
  - fa
  - fi
  - fil
  - fr
  - fy
  - ga
  - gd
  - gl
  - gu
  - ha
  - haw
  - he
  - hi
  - hmn
  - ht
  - hu
  - hy
  - id
  - ig
  - is
  - it
  - iw
  - ja
  - jv
  - ka
  - kk
  - km
  - kn
  - ko
  - ku
  - ky
  - la
  - lb
  - lo
  - lt
  - lv
  - mg
  - mi
  - mk
  - ml
  - mn
  - mr
  - ms
  - mt
  - my
  - ne
  - nl
  - 'no'
  - ny
  - pa
  - pl
  - ps
  - pt
  - ro
  - ru
  - sd
  - si
  - sk
  - sl
  - sm
  - sn
  - so
  - sq
  - sr
  - st
  - su
  - sv
  - sw
  - ta
  - te
  - tg
  - th
  - tr
  - uk
  - und
  - ur
  - uz
  - vi
  - xh
  - yi
  - yo
  - zh
  - zu
license: mit
datasets:
  - mc4

MyT5

Model Details

MyT5 (Myte T5) is a multilingual language model based on T5 architecture. The model uses a morphologically-driven byte (MYTE) representation described in our paper Limisiewicz et al., 2024.

Model Description

  • Developed by: Tomasz Limisiewicz, Terra Blevins, Hila Gonen, Orevaoghene Ahia, Luke Zettlemoyer
  • Funded by: University of Washington Fellowship, Charles University Grant Agency
  • Model type: T5
  • Language(s) (NLP): Multilingual
  • License: MIT

Model Sizes

Model Sources

How to Get Started with the Model

The snippet below shows the basic usage of the model for multilingual language modeling. Custom Tokenizer is available in GitHubrepository, in src/myt5/myt5_tokenizer.py. We also plan to release it on HuggingFace in the future.

from transformers import T5ForConditionalGeneration
from src.myt5.myt5_tokenizer import MyT5Tokenizer
import torch

MODEL_SIZE = "large" # small, base, or large

model = T5ForConditionalGeneration.from_pretrained(f"Tomlim/MyT5_{MODEL_SIZE}", use_safetensors=True)
tokenizer = MyT5Tokenizer()

pre_texts = ['"We now have',
            '„Mamy teraz myszy w wieku',
            '"""எங்களிடம் இப்போது']
post_texts = ['4-month-old mice that are non-diabetic that used to be diabetic," he added.',
              '4 miesięcy, które miały cukrzycę, ale zostały z niej wyleczone” – dodał.',
              '4-மாத-வயதுடைய எலி ஒன்று உள்ளது, முன்னர் அதற்கு நீரிழிவு இருந்தது தற்போது இல்லை"" என்று அவர் மேலும் கூறினார்."']

inputs = tokenizer(pre_texts, padding="longest", return_tensors="pt")
targets = tokenizer(post_texts, padding="longest", return_tensors="pt")


outputs = model(**inputs, labels=targets.input_ids)
probs = torch.nn.functional.softmax(outputs.logits, dim=-1)

Training Details

Training Data

The model was trained on the standard T5 task of restoring corrupted spans in the multilingual MC4 dataset.

Preprocessing

Instead of UTF-8 bytes, we used morphologically-driven byte representation. See the description in our paper for more details.

Training Hyperparameters

We used the same hyperparameters as in the original ByT5 paper. The only difference is that we decreased the number of training steps to 250,000 to avoid overfiting.

Computational Infrastructure

Models were trained on TPUs available through TPU Research Cloud (TRC). We used v3-8 TPU for training small and base models and v3-32 for a large model. The training for each instance took:

  • Small: 90h
  • Base: 230h
  • Large: 190h

Evaluation

MyT5 models are compared with reimplementation of ByT5 models trained for 250,000 steps.

Language Modeling

We have evaluated LM performance on multi-parallel FLORES 200 corpus. To compare the scores across languages and models, we used a normalized metric, i.e., Bit-per-English-Byte (BPEB).

Results

ByT5 MyT5
BPEB T (ms) BPEB T (ms)
small All 10.1 7.0 4.6 6.7
Latin 4.6 5.9 4.2 6.6
Non Latin 18.1 8.5 5.1 6.8
base All 8.2 11.5 5.8 8.9
Latin 4.9 9.4 5.0 8.7
Non Latin 13.0 14.6 6.9 9.1
large All 13.4 31.8 4.6 26.7
Latin 10.1 28.1 4.0 26.6
Non Latin 18.2 37.3 5.4 27.0

Byte-per-English-Bits and Inference times (average per Flores 200 sentence) averaged for three language groupings. The inference was run on an A40 GPU core.

Downstream Tasks

We tested the large model in four end-tasks: question answering, NER, semantic parsing, and machine translation. The test data come from XTREME-UP benchmark (Ruder, Clark et al., 2023), which covers mainly low-resource languages

Fine-tuning

In each task, we fine-tuned for all languages jointly. We used 1e-3 learning rate with square root decay and dropout of 0.1. The batch size and training varied across tasks:

  • NER: 128 examples per batch, 6000 steps
  • QA: 64 examples per batch, 6500 steps
  • Semantic Parsing: 64 examples per batch, 1000 steps
  • MT: 64 examples per batch, 10000 steps

Results

Task QA (F1) NER (F1) Semantic Parsing (EM) MT (chrF)
Flan-PaLM* 22.9 12.0 0.1 ---
mT5* 59.7 74.0 21.8 ---
ByT5 73.2 81.5 25.1 20.1
MyT5 75.3 80.8 19.6 20.4
Inference Times per example (ms)
ByT5 36.2 13.8 13.2 15.9
MyT5 35.6 12.6 12.4 12.6

The average result of XTREME-UP tasks across low-resource languages. The baseline results of mT5 and Flan-PaLM (in-context-learning evaluation) are reported in Ruder, Clark et al., 2023. The reported inference time is an average across evaluation examples; the inference was run on an A40 GPU core.

Citation

@misc{limisiewicz2024myte,
      title={MYTE: Morphology-Driven Byte Encoding for Better and Fairer Multilingual Language Modeling}, 
      author={Tomasz Limisiewicz and Terra Blevins and Hila Gonen and Orevaoghene Ahia and Luke Zettlemoyer},
      year={2024},
      eprint={2403.10691},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}

Model Card Author

Tomasz Limisiewicz--- license: mit language: - af - am - ar - az - be - bg - bn - ca - ceb - co - cs - cy - da - de - el - en - eo - es - et - eu - fa - fi - fil - fr - fy - ga - gd - gl - gu - ha - haw - he - hi - hmn - ht - hu - hy - id - ig - is - it - iw - ja - jv - ka - kk - km - kn - ko - ku - ky - la - lb - lo - lt - lv - mg - mi - mk - ml - mn - mr - ms - mt - my - ne - nl - 'no' - ny - pa - pl - ps - pt - ro - ru - sd - si - sk - sl - sm - sn - so - sq - sr - st - su - sv - sw - ta - te - tg - th - tr - uk - und - ur - uz - vi - xh - yi - yo - zh - zu datasets: - mc4

MyT5

Model Details

MyT5 (Myte T5) is a multilingual language model based on T5 architecture. The model uses a morphologically-driven byte (MYTE) representation described in our paper Limisiewicz et al., 2024.

Model Description

  • Developed by: Tomasz Limisiewicz, Terra Blevins, Hila Gonen, Orevaoghene Ahia, Luke Zettlemoyer
  • Funded by: University of Washington Fellowship, Charles University Grant Agency
  • Model type: T5
  • Language(s) (NLP): Multilingual
  • License: MIT

Model Sizes

Model Sources

How to Get Started with the Model

The snippet below shows the basic usage of the model for multilingual language modeling. Custom Tokenizer is available in GitHubrepository, in src/myt5/myt5_tokenizer.py. We also plan to release it on HuggingFace in the future.

from transformers import T5ForConditionalGeneration
from src.myt5.myt5_tokenizer import MyT5Tokenizer
import torch

MODEL_SIZE = "large" # small, base, or large

model = T5ForConditionalGeneration.from_pretrained(f"Tomlim/MyT5_{MODEL_SIZE}", use_safetensors=True)
tokenizer = MyT5Tokenizer()

pre_texts = ['"We now have',
            '„Mamy teraz myszy w wieku',
            '"""எங்களிடம் இப்போது']
post_texts = ['4-month-old mice that are non-diabetic that used to be diabetic," he added.',
              '4 miesięcy, które miały cukrzycę, ale zostały z niej wyleczone” – dodał.',
              '4-மாத-வயதுடைய எலி ஒன்று உள்ளது, முன்னர் அதற்கு நீரிழிவு இருந்தது தற்போது இல்லை"" என்று அவர் மேலும் கூறினார்."']

inputs = tokenizer(pre_texts, padding="longest", return_tensors="pt")
targets = tokenizer(post_texts, padding="longest", return_tensors="pt")


outputs = model(**inputs, labels=targets.input_ids)
probs = torch.nn.functional.softmax(outputs.logits, dim=-1)

Training Details

Training Data

The model was trained on the standard T5 task of restoring corrupted spans in the multilingual MC4 dataset.

Preprocessing

Instead of UTF-8 bytes, we used morphologically-driven byte representation. See the description in our paper for more details.

Training Hyperparameters

We used the same hyperparameters as in the original ByT5 paper. The only difference is that we decreased the number of training steps to 250,000 to avoid overfiting.

Computational Infrastructure

Models were trained on TPUs available through TPU Research Cloud (TRC). We used v3-8 TPU for training small and base models and v3-32 for a large model. The training for each instance took:

  • Small: 90h
  • Base: 230h
  • Large: 190h

Evaluation

MyT5 models are compared with reimplementation of ByT5 models trained for 250,000 steps.

Language Modeling

We have evaluated LM performance on multi-parallel FLORES 200 corpus. To compare the scores across languages and models, we used a normalized metric, i.e., Bit-per-English-Byte (BPEB).

Results

ByT5 MyT5
BPEB T (ms) BPEB T (ms)
small All 10.1 7.0 4.6 6.7
Latin 4.6 5.9 4.2 6.6
Non Latin 18.1 8.5 5.1 6.8
base All 8.2 11.5 5.8 8.9
Latin 4.9 9.4 5.0 8.7
Non Latin 13.0 14.6 6.9 9.1
large All 13.4 31.8 4.6 26.7
Latin 10.1 28.1 4.0 26.6
Non Latin 18.2 37.3 5.4 27.0

Byte-per-English-Bits and Inference times (average per Flores 200 sentence) averaged for three language groupings. The inference was run on an A40 GPU core.

Citation

@misc{limisiewicz2024myte,
      title={MYTE: Morphology-Driven Byte Encoding for Better and Fairer Multilingual Language Modeling}, 
      author={Tomasz Limisiewicz and Terra Blevins and Hila Gonen and Orevaoghene Ahia and Luke Zettlemoyer},
      year={2024},
      eprint={2403.10691},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}

Model Card Author

Tomasz Limisiewicz