myt5-large / README.md
Tomlim's picture
Upload T5ForConditionalGeneration
02db74d verified
|
raw
history blame
No virus
7.21 kB
---
language:
- af
- am
- ar
- az
- be
- bg
- bn
- ca
- ceb
- co
- cs
- cy
- da
- de
- el
- en
- eo
- es
- et
- eu
- fa
- fi
- fil
- fr
- fy
- ga
- gd
- gl
- gu
- ha
- haw
- he
- hi
- hmn
- ht
- hu
- hy
- id
- ig
- is
- it
- iw
- ja
- jv
- ka
- kk
- km
- kn
- ko
- ku
- ky
- la
- lb
- lo
- lt
- lv
- mg
- mi
- mk
- ml
- mn
- mr
- ms
- mt
- my
- ne
- nl
- 'no'
- ny
- pa
- pl
- ps
- pt
- ro
- ru
- sd
- si
- sk
- sl
- sm
- sn
- so
- sq
- sr
- st
- su
- sv
- sw
- ta
- te
- tg
- th
- tr
- uk
- und
- ur
- uz
- vi
- xh
- yi
- yo
- zh
- zu
license: mit
datasets:
- mc4
---
# MyT5
## Model Details
MyT5 (**My**te **T5**) is a multilingual language model based on T5 architecture.
The model uses a **m**orphologically-driven **byte** (**MYTE**) representation described in our paper [Limisiewicz et al., 2024](https://arxiv.org/pdf/2403.10691.pdf).
### Model Description
<!-- Provide a longer summary of what this model is. -->
- **Developed by:** Tomasz Limisiewicz, Terra Blevins, Hila Gonen, Orevaoghene Ahia, Luke Zettlemoyer
- **Funded by:** University of Washington Fellowship, Charles University Grant Agency
- **Model type:** T5
- **Language(s) (NLP):** Multilingual
- **License:** MIT
### Model Sizes
- **[Small](https://huggingface.co/Tomlim/myt5-small)**: 300M parameters
- **[Base](https://huggingface.co/Tomlim/myt5-base)**: 582M parameters
- **[Large](https://huggingface.co/Tomlim/myt5-large)**: 1.2B parameters
### Model Sources
<!-- Provide the basic links for the model. -->
- **[Repository](https://github.com/tomlimi/MYTE)**
- **[Paper](https://arxiv.org/pdf/2403.10691.pdf)**
## How to Get Started with the Model
The snippet below shows the basic usage of the model for multilingual language modeling.
Custom Tokenizer is available in [GitHub](https://github.com/tomlimi/MYTE])repository, in `src/myt5/myt5_tokenizer.py`.
We also plan to release it on HuggingFace in the future.
```python
from transformers import T5ForConditionalGeneration
from src.myt5.myt5_tokenizer import MyT5Tokenizer
import torch
MODEL_SIZE = "large" # small, base, or large
model = T5ForConditionalGeneration.from_pretrained(f"Tomlim/MyT5_{MODEL_SIZE}", use_safetensors=True)
tokenizer = MyT5Tokenizer()
pre_texts = ['"We now have',
'„Mamy teraz myszy w wieku',
'"""எங்களிடம் இப்போது']
post_texts = ['4-month-old mice that are non-diabetic that used to be diabetic," he added.',
'4 miesięcy, które miały cukrzycę, ale zostały z niej wyleczone” – dodał.',
'4-மாத-வயதுடைய எலி ஒன்று உள்ளது, முன்னர் அதற்கு நீரிழிவு இருந்தது தற்போது இல்லை"" என்று அவர் மேலும் கூறினார்."']
inputs = tokenizer(pre_texts, padding="longest", return_tensors="pt")
targets = tokenizer(post_texts, padding="longest", return_tensors="pt")
outputs = model(**inputs, labels=targets.input_ids)
probs = torch.nn.functional.softmax(outputs.logits, dim=-1)
```
## Training Details
### Training Data
The model was trained on the standard T5 task of restoring corrupted spans in the multilingual MC4 dataset.
### Preprocessing
Instead of UTF-8 bytes, we used morphologically-driven byte representation.
See the description in our [paper](https://arxiv.org/pdf/2403.10691.pdf) for more details.
### Training Hyperparameters
We used the same hyperparameters as in the original ByT5 paper.
The only difference is that we decreased the number of training steps to 250,000 to avoid overfiting.
### Computational Infrastructure
Models were trained on TPUs available through TPU Research Cloud (TRC).
We used v3-8 TPU for training small and base models and v3-32 for a large model.
The training for each instance took:
- **Small**: 90h
- **Base**: 230h
- **Large**: 190h
# Evaluation
<!-- This section describes the evaluation protocols and provides the results. -->
MyT5 models are compared with reimplementation of [ByT5](https://huggingface.co/docs/transformers/model_doc/byt5) models trained for 250,000 steps.
## Language Modeling
We have evaluated LM performance on multi-parallel [FLORES 200](https://arxiv.org/pdf/2207.04672v3.pdf) corpus.
To compare the scores across languages and models, we used a normalized metric, i.e., Bit-per-English-Byte (BPEB).
### Results
| | | ByT5 | | MyT5 | |
|-------|-----------|------|--------|------|--------|
| | | BPEB | T (ms) | BPEB | T (ms) |
| small | All | 10.1 | 7.0 | 4.6 | 6.7 |
| | Latin | 4.6 | 5.9 | 4.2 | 6.6 |
| | Non Latin | 18.1 | 8.5 | 5.1 | 6.8 |
| base | All | 8.2 | 11.5 | 5.8 | 8.9 |
| | Latin | 4.9 | 9.4 | 5.0 | 8.7 |
| | Non Latin | 13.0 | 14.6 | 6.9 | 9.1 |
| large | All | 13.4 | 31.8 | 4.6 | 26.7 |
| | Latin | 10.1 | 28.1 | 4.0 | 26.6 |
| | Non Latin | 18.2 | 37.3 | 5.4 | 27.0 |
Byte-per-English-Bits and Inference times (average per Flores 200 sentence) averaged for three language groupings.
The inference was run on an A40 GPU core.
## Downstream Tasks
We tested the large model in four end-tasks: question answering, NER, semantic parsing, and machine translation.
The test data come from XTREME-UP benchmark ([Ruder, Clark et al., 2023](https://arxiv.org/pdf/2305.11938.pdf)), which covers mainly low-resource languages
### Fine-tuning
In each task, we fine-tuned for all languages jointly.
We used 1e-3 learning rate with square root decay and dropout of 0.1.
The batch size and training varied across tasks:
- **NER**: 128 examples per batch, 6000 steps
- **QA**: 64 examples per batch, 6500 steps
- **Semantic Parsing**: 64 examples per batch, 1000 steps
- **MT**: 64 examples per batch, 10000 steps
### Results
Task | QA (F1) | NER (F1) | Semantic Parsing (EM)| MT (chrF)
------------|------|------|------------------|------
Flan-PaLM* | 22.9 | 12.0 | 0.1 | ---
mT5* | 59.7 | 74.0 | 21.8 | ---
ByT5 | 73.2 | 81.5 | 25.1 | 20.1
MyT5 | 75.3 | 80.8 | 19.6 | 20.4
Inference Times per example (ms)
ByT5 | 36.2 | 13.8 | 13.2 | 15.9
MyT5 | 35.6 | 12.6 | 12.4 | 12.6
The average result of XTREME-UP tasks across low-resource languages.
The baseline results of mT5 and Flan-PaLM (in-context-learning evaluation) are reported in [Ruder, Clark et al., 2023](https://arxiv.org/pdf/2305.11938.pdf).
The reported inference time is an average across evaluation examples; the inference was run on an A40 GPU core.
## Citation
```bibtex
@misc{limisiewicz2024myte,
title={MYTE: Morphology-Driven Byte Encoding for Better and Fairer Multilingual Language Modeling},
author={Tomasz Limisiewicz and Terra Blevins and Hila Gonen and Orevaoghene Ahia and Luke Zettlemoyer},
year={2024},
eprint={2403.10691},
archivePrefix={arXiv},
primaryClass={cs.CL}
}
```
## Model Card Author
[Tomasz Limisiewicz](mailto:limisewicz@ufal.mff.cuni.cz)