Papers
arxiv:2012.15613

How Good is Your Tokenizer? On the Monolingual Performance of Multilingual Language Models

Published on Dec 31, 2020
Authors:
,
,
,
,

Abstract

In this work, we provide a systematic and comprehensive empirical comparison of pretrained multilingual language models versus their monolingual counterparts with regard to their monolingual task performance. We study a set of nine typologically diverse languages with readily available pretrained monolingual models on a set of five diverse monolingual downstream tasks. We first aim to establish, via fair and controlled comparisons, if a gap between the multilingual and the corresponding monolingual representation of that language exists, and subsequently investigate the reason for any performance difference. To disentangle conflating factors, we train new monolingual models on the same data, with monolingually and multilingually trained tokenizers. We find that while the pretraining data size is an important factor, a designated monolingual tokenizer plays an equally important role in the downstream performance. Our results show that languages that are adequately represented in the multilingual model's vocabulary exhibit negligible performance decreases over their monolingual counterparts. We further find that replacing the original multilingual tokenizer with the specialized monolingual tokenizer improves the downstream performance of the multilingual model for almost every task and language.

Community

Dear paper authors ( @plip , @ivulic and @ruder ),

after thinking a while about the subword fertitily rate (incl. the calculation) I came up with one problem:

E.g. imagine a "malicious" tokenizer that just returns the unk token for e.g. every input token or when a token is splitted into too many subwords.

So I guess a better metric compared to subword fertility rate would be to also include the number of unks into the calculation to compare tokenizers?

Please let me know your thoughts on that :)

Sign up or log in to comment

Models citing this paper 3

Datasets citing this paper 1

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2012.15613 in a Space README.md to link it from this page.

Collections including this paper 1