Course documentation

Introduction

Hugging Face's logo
Join the Hugging Face community

and get access to the augmented documentation experience

to get started

Introduction

In Chapter 3, we looked at how to fine-tune a model on a given task. When we do that, we use the same tokenizer that the model was pretrained with — but what do we do when we want to train a model from scratch? In these cases, using a tokenizer that was pretrained on a corpus from another domain or language is typically suboptimal. For example, a tokenizer that’s trained on an English corpus will perform poorly on a corpus of Japanese texts because the use of spaces and punctuation is very different in the two languages.

In this chapter, you will learn how to train a brand new tokenizer on a corpus of texts, so it can then be used to pretrain a language model. This will all be done with the help of the 🤗 Tokenizers library, which provides the “fast” tokenizers in the 🤗 Transformers library. We’ll take a close look at the features that this library provides, and explore how the fast tokenizers differ from the “slow” versions.

Topics we will cover include:

  • How to train a new tokenizer similar to the one used by a given checkpoint on a new corpus of texts
  • The special features of fast tokenizers
  • The differences between the three main subword tokenization algorithms used in NLP today
  • How to build a tokenizer from scratch with the 🤗 Tokenizers library and train it on some data

The techniques introduced in this chapter will prepare you for the section in Chapter 7 where we look at creating a language model for Python source code. Let’s start by looking at what it means to “train” a tokenizer in the first place.