arxiv:2404.08335

Toward a Theory of Tokenization in LLMs

Published on Apr 12

Authors:

Abstract

While there has been a large body of research attempting to circumvent tokenization for language modeling (Clark et al., 2022; Xue et al., 2022), the current consensus is that it is a necessary initial step for designing state-of-the-art performant language models. In this paper, we investigate tokenization from a theoretical point of view by studying the behavior of transformers on simple data generating processes. When trained on data drawn from certain simple k^{th}-order Markov processes for k > 1, transformers exhibit a surprising phenomenon - in the absence of tokenization, they empirically fail to learn the right distribution and predict characters according to a unigram model (Makkuva et al., 2024). With the addition of tokenization, however, we empirically observe that transformers break through this barrier and are able to model the probabilities of sequences drawn from the source near-optimally, achieving small cross-entropy loss. With this observation as starting point, we study the end-to-end cross-entropy loss achieved by transformers with and without tokenization. With the appropriate tokenization, we show that even the simplest unigram models (over tokens) learnt by transformers are able to model the probability of sequences drawn from k^{th}-order Markov sources near optimally. Our analysis provides a justification for the use of tokenization in practice through studying the behavior of transformers on Markovian data.

View arXiv page View PDF Add to collection

Community

stefan-it

Apr 15

"In practice, byte-level/character level models perform worse"

-> This is soo wrong, not only reviewer 2 will complain about this.

Maybe the authors can just have a look at e.g. the ByT5 paper (Table 4) or CharacterBERT from Boukkouri et al.

Additionally, Flair Embeddings (character-based language model) outperforms BERT on CoNLL dataset for Named Entity Recognition, see Table 7 of BERT paper back in 2018.

Nevertheless, there's nice on-going discussions and work on character-level NMT: https://jlibovicky.github.io/2023/01/19/Why-Dont-People-Use-Character-level-MT.html and recent work is soo promising: https://arxiv.org/abs/2302.14220.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2404.08335 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2404.08335 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2404.08335 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.