Tokenizer

The base class PreTrainedTokenizer implements the common methods for loading/saving a tokenizer either from a local file or directory, or from a pretrained tokenizer provided by the library (downloaded from HuggingFace’s AWS S3 repository).

PreTrainedTokenizer is the main entry point into tokenizers as it also implements the main methods for using all the tokenizers:

  • tokenizing, converting tokens to ids and back and encoding/decoding,

  • adding new tokens to the vocabulary in a way that is independant of the underlying structure (BPE, SentencePiece…),

  • managing special tokens (adding them, assigning them to roles, making sure they are not split during tokenization)

PreTrainedTokenizer