In Chapter 3, we looked at how to fine-tune a model on a given task. When we do that, we use the same tokenizer that the model was pretrained with — but what do we do when we want to train a model from scratch? In these cases, using a tokenizer that was pretrained on a corpus from another domain or language is typically suboptimal. For example, a tokenizer that’s trained on an English corpus will perform poorly on a corpus of Japanese texts because the use of spaces and punctuation is very different in the two languages.
In this chapter, you will learn how to train a brand new tokenizer on a corpus of texts, so it can then be used to pretrain a language model. This will all be done with the help of the 🤗 Tokenizers library, which provides the “fast” tokenizers in the 🤗 Transformers library. We’ll take a close look at the features that this library provides, and explore how the fast tokenizers differ from the “slow” versions.
Topics we will cover include:
- How to train a new tokenizer similar to the one used by a given checkpoint on a new corpus of texts
- The special features of fast tokenizers
- The differences between the three main subword tokenization algorithms used in NLP today
- How to build a tokenizer from scratch with the 🤗 Tokenizers library and train it on some data
The techniques introduced in this chapter will prepare you for the section in Chapter 7 where we look at creating a language model for Python source code. Let’s start by looking at what it means to “train” a tokenizer in the first place.