Quicktour ==================================================================================================== Let's have a quick look at the 🤗 Tokenizers library features. The library provides an implementation of today's most used tokenizers that is both easy to use and blazing fast. .. only:: python It can be used to instantiate a :ref:`pretrained tokenizer ` but we will start our quicktour by building one from scratch and see how we can train it. Build a tokenizer from scratch ---------------------------------------------------------------------------------------------------- To illustrate how fast the 🤗 Tokenizers library is, let's train a new tokenizer on `wikitext-103 `__ (516M of text) in just a few seconds. First things first, you will need to download this dataset and unzip it with: .. code-block:: bash wget https://s3.amazonaws.com/research.metamind.io/wikitext/wikitext-103-raw-v1.zip unzip wikitext-103-raw-v1.zip Training the tokenizer ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ .. entities:: python BpeTrainer :class:`~tokenizers.trainers.BpeTrainer` vocab_size :obj:`vocab_size` min_frequency :obj:`min_frequency` special_tokens :obj:`special_tokens` unk_token :obj:`unk_token` pad_token :obj:`pad_token` .. entities:: rust BpeTrainer :rust_struct:`~tokenizers::models::bpe::BpeTrainer` vocab_size :obj:`vocab_size` min_frequency :obj:`min_frequency` special_tokens :obj:`special_tokens` unk_token :obj:`unk_token` pad_token :obj:`pad_token` .. entities:: node BpeTrainer BpeTrainer vocab_size :obj:`vocabSize` min_frequency :obj:`minFrequency` special_tokens :obj:`specialTokens` unk_token :obj:`unkToken` pad_token :obj:`padToken` In this tour, we will build and train a Byte-Pair Encoding (BPE) tokenizer. For more information about the different type of tokenizers, check out this `guide `__ in the 🤗 Transformers documentation. Here, training the tokenizer means it will learn merge rules by: - Start with all the characters present in the training corpus as tokens. - Identify the most common pair of tokens and merge it into one token. - Repeat until the vocabulary (e.g., the number of tokens) has reached the size we want. The main API of the library is the :entity:`class` :entity:`Tokenizer`, here is how we instantiate one with a BPE model: .. only:: python .. literalinclude:: ../../bindings/python/tests/documentation/test_quicktour.py :language: python :start-after: START init_tokenizer :end-before: END init_tokenizer :dedent: 8 .. only:: rust .. literalinclude:: ../../tokenizers/tests/documentation.rs :language: rust :start-after: START quicktour_init_tokenizer :end-before: END quicktour_init_tokenizer :dedent: 4 .. only:: node .. literalinclude:: ../../bindings/node/examples/documentation/quicktour.test.ts :language: javascript :start-after: START init_tokenizer :end-before: END init_tokenizer :dedent: 8 To train our tokenizer on the wikitext files, we will need to instantiate a `trainer`, in this case a :entity:`BpeTrainer` .. only:: python .. literalinclude:: ../../bindings/python/tests/documentation/test_quicktour.py :language: python :start-after: START init_trainer :end-before: END init_trainer :dedent: 8 .. only:: rust .. literalinclude:: ../../tokenizers/tests/documentation.rs :language: rust :start-after: START quicktour_init_trainer :end-before: END quicktour_init_trainer :dedent: 4 .. only:: node .. literalinclude:: ../../bindings/node/examples/documentation/quicktour.test.ts :language: javascript :start-after: START init_trainer :end-before: END init_trainer :dedent: 8 We can set the training arguments like :entity:`vocab_size` or :entity:`min_frequency` (here left at their default values of 30,000 and 0) but the most important part is to give the :entity:`special_tokens` we plan to use later on (they are not used at all during training) so that they get inserted in the vocabulary. .. note:: The order in which you write the special tokens list matters: here :obj:`"[UNK]"` will get the ID 0, :obj:`"[CLS]"` will get the ID 1 and so forth. We could train our tokenizer right now, but it wouldn't be optimal. Without a pre-tokenizer that will split our inputs into words, we might get tokens that overlap several words: for instance we could get an :obj:`"it is"` token since those two words often appear next to each other. Using a pre-tokenizer will ensure no token is bigger than a word returned by the pre-tokenizer. Here we want to train a subword BPE tokenizer, and we will use the easiest pre-tokenizer possible by splitting on whitespace. .. only:: python .. literalinclude:: ../../bindings/python/tests/documentation/test_quicktour.py :language: python :start-after: START init_pretok :end-before: END init_pretok :dedent: 8 .. only:: rust .. literalinclude:: ../../tokenizers/tests/documentation.rs :language: rust :start-after: START quicktour_init_pretok :end-before: END quicktour_init_pretok :dedent: 4 .. only:: node .. literalinclude:: ../../bindings/node/examples/documentation/quicktour.test.ts :language: javascript :start-after: START init_pretok :end-before: END init_pretok :dedent: 8 Now, we can just call the :entity:`Tokenizer.train` method with any list of files we want to use: .. only:: python .. literalinclude:: ../../bindings/python/tests/documentation/test_quicktour.py :language: python :start-after: START train :end-before: END train :dedent: 8 .. only:: rust .. literalinclude:: ../../tokenizers/tests/documentation.rs :language: rust :start-after: START quicktour_train :end-before: END quicktour_train :dedent: 4 .. only:: node .. literalinclude:: ../../bindings/node/examples/documentation/quicktour.test.ts :language: javascript :start-after: START train :end-before: END train :dedent: 8 This should only take a few seconds to train our tokenizer on the full wikitext dataset! To save the tokenizer in one file that contains all its configuration and vocabulary, just use the :entity:`Tokenizer.save` method: .. only:: python .. literalinclude:: ../../bindings/python/tests/documentation/test_quicktour.py :language: python :start-after: START save :end-before: END save :dedent: 8 .. only:: rust .. literalinclude:: ../../tokenizers/tests/documentation.rs :language: rust :start-after: START quicktour_save :end-before: END quicktour_save :dedent: 4 .. only:: node .. literalinclude:: ../../bindings/node/examples/documentation/quicktour.test.ts :language: javascript :start-after: START save :end-before: END save :dedent: 8 and you can reload your tokenizer from that file with the :entity:`Tokenizer.from_file` :entity:`classmethod`: .. only:: python .. literalinclude:: ../../bindings/python/tests/documentation/test_quicktour.py :language: python :start-after: START reload_tokenizer :end-before: END reload_tokenizer :dedent: 12 .. only:: rust .. literalinclude:: ../../tokenizers/tests/documentation.rs :language: rust :start-after: START quicktour_reload_tokenizer :end-before: END quicktour_reload_tokenizer :dedent: 4 .. only:: node .. literalinclude:: ../../bindings/node/examples/documentation/quicktour.test.ts :language: javascript :start-after: START reload_tokenizer :end-before: END reload_tokenizer :dedent: 8 Using the tokenizer ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Now that we have trained a tokenizer, we can use it on any text we want with the :entity:`Tokenizer.encode` method: .. only:: python .. literalinclude:: ../../bindings/python/tests/documentation/test_quicktour.py :language: python :start-after: START encode :end-before: END encode :dedent: 8 .. only:: rust .. literalinclude:: ../../tokenizers/tests/documentation.rs :language: rust :start-after: START quicktour_encode :end-before: END quicktour_encode :dedent: 4 .. only:: node .. literalinclude:: ../../bindings/node/examples/documentation/quicktour.test.ts :language: javascript :start-after: START encode :end-before: END encode :dedent: 8 This applied the full pipeline of the tokenizer on the text, returning an :entity:`Encoding` object. To learn more about this pipeline, and how to apply (or customize) parts of it, check out :doc:`this page `. This :entity:`Encoding` object then has all the attributes you need for your deep learning model (or other). The :obj:`tokens` attribute contains the segmentation of your text in tokens: .. only:: python .. literalinclude:: ../../bindings/python/tests/documentation/test_quicktour.py :language: python :start-after: START print_tokens :end-before: END print_tokens :dedent: 8 .. only:: rust .. literalinclude:: ../../tokenizers/tests/documentation.rs :language: rust :start-after: START quicktour_print_tokens :end-before: END quicktour_print_tokens :dedent: 4 .. only:: node .. literalinclude:: ../../bindings/node/examples/documentation/quicktour.test.ts :language: javascript :start-after: START print_tokens :end-before: END print_tokens :dedent: 8 Similarly, the :obj:`ids` attribute will contain the index of each of those tokens in the tokenizer's vocabulary: .. only:: python .. literalinclude:: ../../bindings/python/tests/documentation/test_quicktour.py :language: python :start-after: START print_ids :end-before: END print_ids :dedent: 8 .. only:: rust .. literalinclude:: ../../tokenizers/tests/documentation.rs :language: rust :start-after: START quicktour_print_ids :end-before: END quicktour_print_ids :dedent: 4 .. only:: node .. literalinclude:: ../../bindings/node/examples/documentation/quicktour.test.ts :language: javascript :start-after: START print_ids :end-before: END print_ids :dedent: 8 An important feature of the 🤗 Tokenizers library is that it comes with full alignment tracking, meaning you can always get the part of your original sentence that corresponds to a given token. Those are stored in the :obj:`offsets` attribute of our :entity:`Encoding` object. For instance, let's assume we would want to find back what caused the :obj:`"[UNK]"` token to appear, which is the token at index 9 in the list, we can just ask for the offset at the index: .. only:: python .. literalinclude:: ../../bindings/python/tests/documentation/test_quicktour.py :language: python :start-after: START print_offsets :end-before: END print_offsets :dedent: 8 .. only:: rust .. literalinclude:: ../../tokenizers/tests/documentation.rs :language: rust :start-after: START quicktour_print_offsets :end-before: END quicktour_print_offsets :dedent: 4 .. only:: node .. literalinclude:: ../../bindings/node/examples/documentation/quicktour.test.ts :language: javascript :start-after: START print_offsets :end-before: END print_offsets :dedent: 8 and those are the indices that correspond to the emoji in the original sentence: .. only:: python .. literalinclude:: ../../bindings/python/tests/documentation/test_quicktour.py :language: python :start-after: START use_offsets :end-before: END use_offsets :dedent: 8 .. only:: rust .. literalinclude:: ../../tokenizers/tests/documentation.rs :language: rust :start-after: START quicktour_use_offsets :end-before: END quicktour_use_offsets :dedent: 4 .. only:: node .. literalinclude:: ../../bindings/node/examples/documentation/quicktour.test.ts :language: javascript :start-after: START use_offsets :end-before: END use_offsets :dedent: 8 Post-processing ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ We might want our tokenizer to automatically add special tokens, like :obj:`"[CLS]"` or :obj:`"[SEP]"`. To do this, we use a post-processor. :entity:`TemplateProcessing` is the most commonly used, you just have to specify a template for the processing of single sentences and pairs of sentences, along with the special tokens and their IDs. When we built our tokenizer, we set :obj:`"[CLS]"` and :obj:`"[SEP]"` in positions 1 and 2 of our list of special tokens, so this should be their IDs. To double-check, we can use the :entity:`Tokenizer.token_to_id` method: .. only:: python .. literalinclude:: ../../bindings/python/tests/documentation/test_quicktour.py :language: python :start-after: START check_sep :end-before: END check_sep :dedent: 8 .. only:: rust .. literalinclude:: ../../tokenizers/tests/documentation.rs :language: rust :start-after: START quicktour_check_sep :end-before: END quicktour_check_sep :dedent: 4 .. only:: node .. literalinclude:: ../../bindings/node/examples/documentation/quicktour.test.ts :language: javascript :start-after: START check_sep :end-before: END check_sep :dedent: 8 Here is how we can set the post-processing to give us the traditional BERT inputs: .. only:: python .. literalinclude:: ../../bindings/python/tests/documentation/test_quicktour.py :language: python :start-after: START init_template_processing :end-before: END init_template_processing :dedent: 8 .. only:: rust .. literalinclude:: ../../tokenizers/tests/documentation.rs :language: rust :start-after: START quicktour_init_template_processing :end-before: END quicktour_init_template_processing :dedent: 4 .. only:: node .. literalinclude:: ../../bindings/node/examples/documentation/quicktour.test.ts :language: javascript :start-after: START init_template_processing :end-before: END init_template_processing :dedent: 8 Let's go over this snippet of code in more details. First we specify the template for single sentences: those should have the form :obj:`"[CLS] $A [SEP]"` where :obj:`$A` represents our sentence. Then, we specify the template for sentence pairs, which should have the form :obj:`"[CLS] $A [SEP] $B [SEP]"` where :obj:`$A` represents the first sentence and :obj:`$B` the second one. The :obj:`:1` added in the template represent the `type IDs` we want for each part of our input: it defaults to 0 for everything (which is why we don't have :obj:`$A:0`) and here we set it to 1 for the tokens of the second sentence and the last :obj:`"[SEP]"` token. Lastly, we specify the special tokens we used and their IDs in our tokenizer's vocabulary. To check out this worked properly, let's try to encode the same sentence as before: .. only:: python .. literalinclude:: ../../bindings/python/tests/documentation/test_quicktour.py :language: python :start-after: START print_special_tokens :end-before: END print_special_tokens :dedent: 8 .. only:: rust .. literalinclude:: ../../tokenizers/tests/documentation.rs :language: rust :start-after: START quicktour_print_special_tokens :end-before: END quicktour_print_special_tokens :dedent: 4 .. only:: node .. literalinclude:: ../../bindings/node/examples/documentation/quicktour.test.ts :language: javascript :start-after: START print_special_tokens :end-before: END print_special_tokens :dedent: 8 To check the results on a pair of sentences, we just pass the two sentences to :entity:`Tokenizer.encode`: .. only:: python .. literalinclude:: ../../bindings/python/tests/documentation/test_quicktour.py :language: python :start-after: START print_special_tokens_pair :end-before: END print_special_tokens_pair :dedent: 8 .. only:: rust .. literalinclude:: ../../tokenizers/tests/documentation.rs :language: rust :start-after: START quicktour_print_special_tokens_pair :end-before: END quicktour_print_special_tokens_pair :dedent: 4 .. only:: node .. literalinclude:: ../../bindings/node/examples/documentation/quicktour.test.ts :language: javascript :start-after: START print_special_tokens_pair :end-before: END print_special_tokens_pair :dedent: 8 You can then check the type IDs attributed to each token is correct with .. only:: python .. literalinclude:: ../../bindings/python/tests/documentation/test_quicktour.py :language: python :start-after: START print_type_ids :end-before: END print_type_ids :dedent: 8 .. only:: rust .. literalinclude:: ../../tokenizers/tests/documentation.rs :language: rust :start-after: START quicktour_print_type_ids :end-before: END quicktour_print_type_ids :dedent: 4 .. only:: node .. literalinclude:: ../../bindings/node/examples/documentation/quicktour.test.ts :language: javascript :start-after: START print_type_ids :end-before: END print_type_ids :dedent: 8 If you save your tokenizer with :entity:`Tokenizer.save`, the post-processor will be saved along. Encoding multiple sentences in a batch ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ To get the full speed of the 🤗 Tokenizers library, it's best to process your texts by batches by using the :entity:`Tokenizer.encode_batch` method: .. only:: python .. literalinclude:: ../../bindings/python/tests/documentation/test_quicktour.py :language: python :start-after: START encode_batch :end-before: END encode_batch :dedent: 8 .. only:: rust .. literalinclude:: ../../tokenizers/tests/documentation.rs :language: rust :start-after: START quicktour_encode_batch :end-before: END quicktour_encode_batch :dedent: 4 .. only:: node .. literalinclude:: ../../bindings/node/examples/documentation/quicktour.test.ts :language: javascript :start-after: START encode_batch :end-before: END encode_batch :dedent: 8 The output is then a list of :entity:`Encoding` objects like the ones we saw before. You can process together as many texts as you like, as long as it fits in memory. To process a batch of sentences pairs, pass two lists to the :entity:`Tokenizer.encode_batch` method: the list of sentences A and the list of sentences B: .. only:: python .. literalinclude:: ../../bindings/python/tests/documentation/test_quicktour.py :language: python :start-after: START encode_batch_pair :end-before: END encode_batch_pair :dedent: 8 .. only:: rust .. literalinclude:: ../../tokenizers/tests/documentation.rs :language: rust :start-after: START quicktour_encode_batch_pair :end-before: END quicktour_encode_batch_pair :dedent: 4 .. only:: node .. literalinclude:: ../../bindings/node/examples/documentation/quicktour.test.ts :language: javascript :start-after: START encode_batch_pair :end-before: END encode_batch_pair :dedent: 8 When encoding multiple sentences, you can automatically pad the outputs to the longest sentence present by using :entity:`Tokenizer.enable_padding`, with the :entity:`pad_token` and its ID (which we can double-check the id for the padding token with :entity:`Tokenizer.token_to_id` like before): .. only:: python .. literalinclude:: ../../bindings/python/tests/documentation/test_quicktour.py :language: python :start-after: START enable_padding :end-before: END enable_padding :dedent: 8 .. only:: rust .. literalinclude:: ../../tokenizers/tests/documentation.rs :language: rust :start-after: START quicktour_enable_padding :end-before: END quicktour_enable_padding :dedent: 4 .. only:: node .. literalinclude:: ../../bindings/node/examples/documentation/quicktour.test.ts :language: javascript :start-after: START enable_padding :end-before: END enable_padding :dedent: 8 We can set the :obj:`direction` of the padding (defaults to the right) or a given :obj:`length` if we want to pad every sample to that specific number (here we leave it unset to pad to the size of the longest text). .. only:: python .. literalinclude:: ../../bindings/python/tests/documentation/test_quicktour.py :language: python :start-after: START print_batch_tokens :end-before: END print_batch_tokens :dedent: 8 .. only:: rust .. literalinclude:: ../../tokenizers/tests/documentation.rs :language: rust :start-after: START quicktour_print_batch_tokens :end-before: END quicktour_print_batch_tokens :dedent: 4 .. only:: node .. literalinclude:: ../../bindings/node/examples/documentation/quicktour.test.ts :language: javascript :start-after: START print_batch_tokens :end-before: END print_batch_tokens :dedent: 8 In this case, the `attention mask` generated by the tokenizer takes the padding into account: .. only:: python .. literalinclude:: ../../bindings/python/tests/documentation/test_quicktour.py :language: python :start-after: START print_attention_mask :end-before: END print_attention_mask :dedent: 8 .. only:: rust .. literalinclude:: ../../tokenizers/tests/documentation.rs :language: rust :start-after: START quicktour_print_attention_mask :end-before: END quicktour_print_attention_mask :dedent: 4 .. only:: node .. literalinclude:: ../../bindings/node/examples/documentation/quicktour.test.ts :language: javascript :start-after: START print_attention_mask :end-before: END print_attention_mask :dedent: 8 .. _pretrained: .. only:: python Using a pretrained tokenizer ------------------------------------------------------------------------------------------------ You can load any tokenizer from the Hugging Face Hub as long as a `tokenizer.json` file is available in the repository. .. code-block:: python from tokenizers import Tokenizer tokenizer = Tokenizer.from_pretrained("bert-base-uncased") Importing a pretrained tokenizer from legacy vocabulary files ------------------------------------------------------------------------------------------------ You can also import a pretrained tokenizer directly in, as long as you have its vocabulary file. For instance, here is how to import the classic pretrained BERT tokenizer: .. code-block:: python from tokenizers import BertWordPieceTokenizer tokenizer = BertWordPieceTokenizer("bert-base-uncased-vocab.txt", lowercase=True) as long as you have downloaded the file `bert-base-uncased-vocab.txt` with .. code-block:: bash wget https://s3.amazonaws.com/models.huggingface.co/bert/bert-base-uncased-vocab.txt