hmByT5 - Preliminary Language Models

Preliminary Historic Multilingual and Monolingual ByT5 Models. Following languages are currently covered:

English (British Library Corpus - Books)

More details can be found in our GitHub repository.

Pretraining

We use the official JAX/FLAX example in Hugging Face Transformers to pretrain a ByT5 model on a single v3-8 TPU. Details about the training can be found here.

This model was trained with mean_noise_span_length=20.

Evaluation on Downstream Tasks (NER)

We evaluated the hmByT5 Base model on English AjMC dataset:

Configuration	Run 1	Run 2	Run 3	Run 4	Run 5	Avg.
`wsFalse-bs8-e10-lr0.00015-poolingfirst`	86.51	87.2	86.22	85.78	86.46	86.43 ± 0.46
`wsFalse-bs4-e10-lr0.00016-poolingfirst`	86.12	87.04	87.01	85.25	86.74	86.43 ± 0.68
`wsFalse-bs8-e10-lr0.00016-poolingfirst`	86.49	85.27	86.12	86.29	85.61	85.96 ± 0.45
`wsFalse-bs4-e10-lr0.00015-poolingfirst`	86.33	86.05	84.48	85.68	86.16	85.74 ± 0.67

The ByT5 Small model achieves 85.65 ± 1.21 on this dataset.

Acknowledgements

Research supported with Cloud TPUs from Google's TPU Research Cloud (TRC). Many Thanks for providing access to the TPUs ❤️