When building a Tokenizer, you can attach various types of components to this Tokenizer in order to customize its behavior. This page lists most provided components.
Normalizer is in charge of pre-processing the input string in order
to normalize it as relevant for a given use case. Some common examples
of normalization are the Unicode normalization algorithms (NFD, NFKD,
NFC & NFKC), lowercasing etc… The specificity of
tokenizers is that
we keep track of the alignment while normalizing. This is essential to
allow mapping from the generated tokens back to the input text.
Normalizer is optional.
|NFD||NFD unicode normalization|
|NFKD||NFKD unicode normalization|
|NFC||NFC unicode normalization|
|NFKC||NFKC unicode normalization|
|Lowercase||Replaces all uppercase to lowercase||Input: |
|Strip||Removes all whitespace characters on the specified sides (left, right or both) of the input||Input: |
|StripAccents||Removes all accent symbols in unicode (to be used with NFD for consistency)||Input: |
|Replace||Replaces a custom string or regexp and changes it with given content|
|BertNormalizer||Provides an implementation of the Normalizer used in the original BERT. Options that can be set are:
|Sequence||Composes multiple normalizers that will run in the provided order|
PreTokenizer takes care of splitting the input according to a set
of rules. This pre-processing lets you ensure that the underlying
Model does not build tokens across multiple “splits”. For example if
you don’t want to have whitespaces inside a token, then you can have a
PreTokenizer that splits on these whitespaces.
You can easily combine multiple
PreTokenizer together using a
Sequence (see below). The
PreTokenizer is also allowed to modify the
string, just like a
Normalizer does. This is necessary to allow some
complicated algorithms that require to split before normalizing (e.g.
|ByteLevel||Splits on whitespaces while remapping all the bytes to a set of visible characters. This technique as been introduced by OpenAI with GPT-2 and has some more or less nice properties:
|Whitespace||Splits on word boundaries (using the following regular expression:
|WhitespaceSplit||Splits on any whitespace character||Input: |
|Punctuation||Will isolate all punctuation characters||Input: |
|Metaspace||Splits on whitespaces and replaces them with a special char “▁” (U+2581)||Input: |
|CharDelimiterSplit||Splits on a given character||Example with |
|Digits||Splits the numbers from any other characters.||Input: |
|Split||Versatile pre-tokenizer that splits on provided pattern and according to provided behavior. The pattern can be inverted if necessary.
||Example with pattern = |
|Sequence||Lets you compose multiple
Models are the core algorithms used to actually tokenize, and therefore, they are the only mandatory component of a Tokenizer.
|WordLevel||This is the “classic” tokenization algorithm. It let’s you simply map words to IDs without anything fancy. This has the advantage of being really simple to use and understand, but it requires extremely large vocabularies for a good coverage. Using this |
|BPE||One of the most popular subword tokenization algorithm. The Byte-Pair-Encoding works by starting with characters, while merging those that are the most frequently seen together, thus creating new tokens. It then works iteratively to build new tokens out of the most frequent pairs it sees in a corpus. BPE is able to build words it has never seen by using multiple subword tokens, and thus requires smaller vocabularies, with less chances of having “unk” (unknown) tokens.|
|WordPiece||This is a subword tokenization algorithm quite similar to BPE, used mainly by Google in models like BERT. It uses a greedy algorithm, that tries to build long words first, splitting in multiple tokens when entire words don’t exist in the vocabulary. This is different from BPE that starts from characters, building bigger tokens as possible. It uses the famous |
|Unigram||Unigram is also a subword tokenization algorithm, and works by trying to identify the best set of subword tokens to maximize the probability for a given sentence. This is different from BPE in the way that this is not deterministic based on a set of rules applied sequentially. Instead Unigram will be able to compute multiple ways of tokenizing, while choosing the most probable one.|
After the whole pipeline, we sometimes want to insert some special
tokens before feed a tokenized string into a model like ”[CLS] My
horse is amazing [SEP]”. The
PostProcessor is the component doing
|TemplateProcessing||Let’s you easily template the post processing, adding special tokens, and specifying the
||Example, when specifying a template with these values:|
The Decoder knows how to go from the IDs used by the Tokenizer, back to
a readable piece of text. Some
special characters or identifiers that need to be reverted for example.
|ByteLevel||Reverts the ByteLevel PreTokenizer. This PreTokenizer encodes at the byte-level, using a set of visible Unicode characters to represent each byte, so we need a Decoder to revert this process and get something readable again.|
|Metaspace||Reverts the Metaspace PreTokenizer. This PreTokenizer uses a special identifer |
|WordPiece||Reverts the WordPiece Model. This model uses a special identifier |