myt5-large / README.md

Tomlim

Upload T5ForConditionalGeneration

02db74d verified 4 months ago

preview code

raw

history blame

No virus

7.21 kB

	---
	language:
	- af
	- am
	- ar
	- az
	- be
	- bg
	- bn
	- ca
	- ceb
	- co
	- cs
	- cy
	- da
	- de
	- el
	- en
	- eo
	- es
	- et
	- eu
	- fa
	- fi
	- fil
	- fr
	- fy
	- ga
	- gd
	- gl
	- gu
	- ha
	- haw
	- he
	- hi
	- hmn
	- ht
	- hu
	- hy
	- id
	- ig
	- is
	- it
	- iw
	- ja
	- jv
	- ka
	- kk
	- km
	- kn
	- ko
	- ku
	- ky
	- la
	- lb
	- lo
	- lt
	- lv
	- mg
	- mi
	- mk
	- ml
	- mn
	- mr
	- ms
	- mt
	- my
	- ne
	- nl
	- 'no'
	- ny
	- pa
	- pl
	- ps
	- pt
	- ro
	- ru
	- sd
	- si
	- sk
	- sl
	- sm
	- sn
	- so
	- sq
	- sr
	- st
	- su
	- sv
	- sw
	- ta
	- te
	- tg
	- th
	- tr
	- uk
	- und
	- ur
	- uz
	- vi
	- xh
	- yi
	- yo
	- zh
	- zu
	license: mit
	datasets:
	- mc4
	---

	# MyT5



	## Model Details

	MyT5 (Myte T5) is a multilingual language model based on T5 architecture.
	The model uses a morphologically-driven byte (MYTE) representation described in our paper [Limisiewicz et al., 2024](https://arxiv.org/pdf/2403.10691.pdf).

	### Model Description

	<!-- Provide a longer summary of what this model is. -->

	- Developed by: Tomasz Limisiewicz, Terra Blevins, Hila Gonen, Orevaoghene Ahia, Luke Zettlemoyer
	- Funded by: University of Washington Fellowship, Charles University Grant Agency
	- Model type: T5
	- Language(s) (NLP): Multilingual
	- License: MIT

	### Model Sizes

	- [Small](https://huggingface.co/Tomlim/myt5-small): 300M parameters
	- [Base](https://huggingface.co/Tomlim/myt5-base): 582M parameters
	- [Large](https://huggingface.co/Tomlim/myt5-large): 1.2B parameters

	### Model Sources

	<!-- Provide the basic links for the model. -->

	- [Repository](https://github.com/tomlimi/MYTE)
	- [Paper](https://arxiv.org/pdf/2403.10691.pdf)

	## How to Get Started with the Model

	The snippet below shows the basic usage of the model for multilingual language modeling.
	Custom Tokenizer is available in [GitHub](https://github.com/tomlimi/MYTE])repository, in `src/myt5/myt5_tokenizer.py`.
	We also plan to release it on HuggingFace in the future.

	```python
	from transformers import T5ForConditionalGeneration
	from src.myt5.myt5_tokenizer import MyT5Tokenizer
	import torch

	MODEL_SIZE = "large" # small, base, or large

	model = T5ForConditionalGeneration.from_pretrained(f"Tomlim/MyT5_{MODEL_SIZE}", use_safetensors=True)
	tokenizer = MyT5Tokenizer()

	pre_texts = ['"We now have',
	'„Mamy teraz myszy w wieku',
	'"""எங்களிடம் இப்போது']
	post_texts = ['4-month-old mice that are non-diabetic that used to be diabetic," he added.',
	'4 miesięcy, które miały cukrzycę, ale zostały z niej wyleczone” – dodał.',
	'4-மாத-வயதுடைய எலி ஒன்று உள்ளது, முன்னர் அதற்கு நீரிழிவு இருந்தது தற்போது இல்லை"" என்று அவர் மேலும் கூறினார்."']

	inputs = tokenizer(pre_texts, padding="longest", return_tensors="pt")
	targets = tokenizer(post_texts, padding="longest", return_tensors="pt")


	outputs = model(**inputs, labels=targets.input_ids)
	probs = torch.nn.functional.softmax(outputs.logits, dim=-1)
	```

	## Training Details

	### Training Data

	The model was trained on the standard T5 task of restoring corrupted spans in the multilingual MC4 dataset.

	### Preprocessing

	Instead of UTF-8 bytes, we used morphologically-driven byte representation.
	See the description in our [paper](https://arxiv.org/pdf/2403.10691.pdf) for more details.


	### Training Hyperparameters

	We used the same hyperparameters as in the original ByT5 paper.
	The only difference is that we decreased the number of training steps to 250,000 to avoid overfiting.

	### Computational Infrastructure

	Models were trained on TPUs available through TPU Research Cloud (TRC).
	We used v3-8 TPU for training small and base models and v3-32 for a large model.
	The training for each instance took:

	- Small: 90h
	- Base: 230h
	- Large: 190h

	# Evaluation

	<!-- This section describes the evaluation protocols and provides the results. -->

	MyT5 models are compared with reimplementation of [ByT5](https://huggingface.co/docs/transformers/model_doc/byt5) models trained for 250,000 steps.

	## Language Modeling

	We have evaluated LM performance on multi-parallel [FLORES 200](https://arxiv.org/pdf/2207.04672v3.pdf) corpus.
	To compare the scores across languages and models, we used a normalized metric, i.e., Bit-per-English-Byte (BPEB).

	### Results

	\| \| \| ByT5 \| \| MyT5 \| \|
	\|-------\|-----------\|------\|--------\|------\|--------\|
	\| \| \| BPEB \| T (ms) \| BPEB \| T (ms) \|
	\| small \| All \| 10.1 \| 7.0 \| 4.6 \| 6.7 \|
	\| \| Latin \| 4.6 \| 5.9 \| 4.2 \| 6.6 \|
	\| \| Non Latin \| 18.1 \| 8.5 \| 5.1 \| 6.8 \|
	\| base \| All \| 8.2 \| 11.5 \| 5.8 \| 8.9 \|
	\| \| Latin \| 4.9 \| 9.4 \| 5.0 \| 8.7 \|
	\| \| Non Latin \| 13.0 \| 14.6 \| 6.9 \| 9.1 \|
	\| large \| All \| 13.4 \| 31.8 \| 4.6 \| 26.7 \|
	\| \| Latin \| 10.1 \| 28.1 \| 4.0 \| 26.6 \|
	\| \| Non Latin \| 18.2 \| 37.3 \| 5.4 \| 27.0 \|

	Byte-per-English-Bits and Inference times (average per Flores 200 sentence) averaged for three language groupings.
	The inference was run on an A40 GPU core.

	## Downstream Tasks

	We tested the large model in four end-tasks: question answering, NER, semantic parsing, and machine translation.
	The test data come from XTREME-UP benchmark ([Ruder, Clark et al., 2023](https://arxiv.org/pdf/2305.11938.pdf)), which covers mainly low-resource languages

	### Fine-tuning

	In each task, we fine-tuned for all languages jointly.
	We used 1e-3 learning rate with square root decay and dropout of 0.1.
	The batch size and training varied across tasks:

	- NER: 128 examples per batch, 6000 steps
	- QA: 64 examples per batch, 6500 steps
	- Semantic Parsing: 64 examples per batch, 1000 steps
	- MT: 64 examples per batch, 10000 steps


	### Results

	Task \| QA (F1) \| NER (F1) \| Semantic Parsing (EM)\| MT (chrF)
	------------\|------\|------\|------------------\|------
	Flan-PaLM* \| 22.9 \| 12.0 \| 0.1 \| ---
	mT5* \| 59.7 \| 74.0 \| 21.8 \| ---
	ByT5 \| 73.2 \| 81.5 \| 25.1 \| 20.1
	MyT5 \| 75.3 \| 80.8 \| 19.6 \| 20.4
	Inference Times per example (ms)
	ByT5 \| 36.2 \| 13.8 \| 13.2 \| 15.9
	MyT5 \| 35.6 \| 12.6 \| 12.4 \| 12.6

	The average result of XTREME-UP tasks across low-resource languages.
	The baseline results of mT5 and Flan-PaLM (in-context-learning evaluation) are reported in [Ruder, Clark et al., 2023](https://arxiv.org/pdf/2305.11938.pdf).
	The reported inference time is an average across evaluation examples; the inference was run on an A40 GPU core.

	## Citation

	```bibtex
	@misc{limisiewicz2024myte,
	title={MYTE: Morphology-Driven Byte Encoding for Better and Fairer Multilingual Language Modeling},
	author={Tomasz Limisiewicz and Terra Blevins and Hila Gonen and Orevaoghene Ahia and Luke Zettlemoyer},
	year={2024},
	eprint={2403.10691},
	archivePrefix={arXiv},
	primaryClass={cs.CL}
	}
	```


	## Model Card Author

	[Tomasz Limisiewicz](mailto:limisewicz@ufal.mff.cuni.cz)