Input sequences¶

These types represent all the different kinds of sequence that can be used as input of a Tokenizer. Globally, any sequence can be either a string or a list of strings, according to the operating mode of the tokenizer: raw text vs pre-tokenized.

tokenizers.TextInputSequence = <class 'str'>¶: A str that represents an input sequence

tokenizers.PreTokenizedInputSequence = typing.Union[typing.List[str], typing.Tuple[str]]¶

A pre-tokenized input sequence. Can be one of:

A List of str
A Tuple of str

tokenizers.InputSequence = typing.Union[str, typing.List[str], typing.Tuple[str]]¶

Represents all the possible types of input sequences for encoding. Can be:

When is_pretokenized=False: TextInputSequence
When is_pretokenized=True: PreTokenizedInputSequence

Encode inputs¶

These types represent all the different kinds of input that a Tokenizer accepts when using encode_batch().

tokenizers.TextEncodeInput = typing.Union[str, typing.Tuple[str, str], typing.List[str]]¶

Represents a textual input for encoding. Can be either:

A single sequence: TextInputSequence
A pair of sequences:
- A Tuple of TextInputSequence
- Or a List of TextInputSequence of size 2

tokenizers.PreTokenizedEncodeInput = typing.Union[typing.List[str], typing.Tuple[str], typing.Tuple[typing.Union[typing.List[str], typing.Tuple[str]], typing.Union[typing.List[str], typing.Tuple[str]]], typing.List[typing.Union[typing.List[str], typing.Tuple[str]]]]¶

Represents a pre-tokenized input for encoding. Can be either:

A single sequence: PreTokenizedInputSequence
A pair of sequences:
- A Tuple of PreTokenizedInputSequence
- Or a List of PreTokenizedInputSequence of size 2

tokenizers.EncodeInput = typing.Union[str, typing.Tuple[str, str], typing.List[str], typing.Tuple[str], typing.Tuple[typing.Union[typing.List[str], typing.Tuple[str]], typing.Union[typing.List[str], typing.Tuple[str]]], typing.List[typing.Union[typing.List[str], typing.Tuple[str]]]]¶

Represents all the possible types of input for encoding. Can be:

When is_pretokenized=False: TextEncodeInput
When is_pretokenized=True: PreTokenizedEncodeInput

Tokenizer¶

class tokenizers.Tokenizer(self, model)¶

A Tokenizer works as a pipeline. It processes some raw text as input and outputs an Encoding.

Parameters: model (Model) – The core algorithm that this Tokenizer should be using.

add_special_tokens(tokens)¶

Add the given special tokens to the Tokenizer.

If these tokens are already part of the vocabulary, it just let the Tokenizer know about them. If they don’t exist, the Tokenizer creates them, giving them a new id.

These special tokens will never be processed by the model (ie won’t be split into multiple tokens), and they can be removed from the output when decoding.

Parameters: tokens (A List of AddedToken or str) – The list of special tokens we want to add to the vocabulary. Each token can either be a string or an instance of AddedToken for more customization.
Returns: The number of tokens that were created in the vocabulary
Return type: int

add_tokens(tokens)¶

Add the given tokens to the vocabulary

The given tokens are added only if they don’t already exist in the vocabulary. Each token then gets a new attributed id.

Parameters: tokens (A List of AddedToken or str) – The list of tokens we want to add to the vocabulary. Each token can be either a string or an instance of AddedToken for more customization.
Returns: The number of tokens that were created in the vocabulary
Return type: int

decode(ids, skip_special_tokens=True)¶

Decode the given list of ids back to a string

This is used to decode anything coming back from a Language Model

Parameters

ids (A List/Tuple of int) – The list of ids that we want to decode
skip_special_tokens (bool, defaults to True) – Whether the special tokens should be removed from the decoded string

Returns

The decoded string

Return type

str

decode_batch(sequences, skip_special_tokens=True)¶

Decode a batch of ids back to their corresponding string

Parameters

sequences (List of List[int]) – The batch of sequences we want to decode
skip_special_tokens (bool, defaults to True) – Whether the special tokens should be removed from the decoded strings

Returns

A list of decoded strings

Return type

List[str]

decoder¶: The optional Decoder in use by the Tokenizer

enable_padding(direction='right', pad_id=0, pad_type_id=0, pad_token='[PAD]', length=None, pad_to_multiple_of=None)¶

Enable the padding

Parameters

direction (str, optional, defaults to right) – The direction in which to pad. Can be either right or left
pad_to_multiple_of (int, optional) – If specified, the padding length should always snap to the next multiple of the given value. For example if we were going to pad witha length of 250 but pad_to_multiple_of=8 then we will pad to 256.
pad_id (int, defaults to 0) – The id to be used when padding
pad_type_id (int, defaults to 0) – The type id to be used when padding
pad_token (str, defaults to [PAD]) – The pad token to be used when padding
length (int, optional) – If specified, the length at which to pad. If not specified we pad using the size of the longest sequence in a batch.

enable_truncation(max_length, stride=0, strategy='longest_first')¶

Enable truncation

Parameters

max_length (int) – The max length at which to truncate
stride (int, optional) – The length of the previous first sequence to be included in the overflowing sequence
strategy (str, optional, defaults to longest_first) – The strategy used to truncation. Can be one of longest_first, only_first or only_second.

encode(sequence, pair=None, is_pretokenized=False, add_special_tokens=True)¶

Encode the given sequence and pair. This method can process raw text sequences as well as already pre-tokenized sequences.

Example

Here are some examples of the inputs that are accepted:

encode("A single sequence")`
encode("A sequence", "And its pair")`
encode([ "A", "pre", "tokenized", "sequence" ], is_pretokenized=True)`
encode(
    [ "A", "pre", "tokenized", "sequence" ], [ "And", "its", "pair" ],
    is_pretokenized=True
)

Parameters

sequence (InputSequence) –
The main input sequence we want to encode. This sequence can be either raw text or pre-tokenized, according to the is_pretokenized argument:
- If is_pretokenized=False: TextInputSequence
- If is_pretokenized=True: PreTokenizedInputSequence
pair (InputSequence, optional) – An optional input sequence. The expected format is the same that for sequence.
is_pretokenized (bool, defaults to False) – Whether the input is already pre-tokenized
add_special_tokens (bool, defaults to True) – Whether to add the special tokens

Returns

The encoded result

Return type

Encoding

encode_batch(input, is_pretokenized=False, add_special_tokens=True)¶

Encode the given batch of inputs. This method accept both raw text sequences as well as already pre-tokenized sequences.

Example

Here are some examples of the inputs that are accepted:

encode_batch([
    "A single sequence",
    ("A tuple with a sequence", "And its pair"),
    [ "A", "pre", "tokenized", "sequence" ],
    ([ "A", "pre", "tokenized", "sequence" ], "And its pair")
])

Parameters

input (A List/Tuple of EncodeInput) –
A list of single sequences or pair sequences to encode. Each sequence can be either raw text or pre-tokenized, according to the is_pretokenized argument:
- If is_pretokenized=False: TextEncodeInput
- If is_pretokenized=True: PreTokenizedEncodeInput
is_pretokenized (bool, defaults to False) – Whether the input is already pre-tokenized
add_special_tokens (bool, defaults to True) – Whether to add the special tokens

Returns

The encoded batch

Return type

A List of Encoding

static from_buffer(buffer)¶

Instantiate a new Tokenizer from the given buffer.

Parameters: buffer (bytes) – A buffer containing a previously serialized Tokenizer
Returns: The new tokenizer
Return type: Tokenizer

static from_file(path)¶

Instantiate a new Tokenizer from the file at the given path.

Parameters: path (str) – A path to a local JSON file representing a previously serialized Tokenizer
Returns: The new tokenizer
Return type: Tokenizer

static from_str(json)¶

Instantiate a new Tokenizer from the given JSON string.

Parameters: json (str) – A valid JSON string representing a previously serialized Tokenizer
Returns: The new tokenizer
Return type: Tokenizer

get_vocab(with_added_tokens=True)¶

Get the underlying vocabulary

Parameters: with_added_tokens (bool, defaults to True) – Whether to include the added tokens
Returns: The vocabulary
Return type: Dict[str, int]

get_vocab_size(with_added_tokens=True)¶

Get the size of the underlying vocabulary

Parameters: with_added_tokens (bool, defaults to True) – Whether to include the added tokens
Returns: The size of the vocabulary
Return type: int

id_to_token(id)¶

Convert the given id to its corresponding token if it exists

Parameters: id (int) – The id to convert
Returns: An optional token, None if out of vocabulary
Return type: Optional[str]

model¶: The Model in use by the Tokenizer

no_padding()¶: Disable padding

no_truncation()¶: Disable truncation

normalizer¶: The optional Normalizer in use by the Tokenizer

num_special_tokens_to_add(is_pair)¶: Return the number of special tokens that would be added for single/pair sentences. :param is_pair: Boolean indicating if the input would be a single sentence or a pair :return:

padding¶

Get the current padding parameters

Cannot be set, use enable_padding() instead

Returns: A dict with the current padding parameters if padding is enabled
Return type: (dict, optional)

post_process(encoding, pair=None, add_special_tokens=True)¶

Apply all the post-processing steps to the given encodings.

The various steps are:

Truncate according to the set truncation params (provided with enable_truncation())

Apply the PostProcessor

Pad according to the set padding params (provided with enable_padding())

Parameters

encoding (Encoding) – The Encoding corresponding to the main sequence.
pair (Encoding, optional) – An optional Encoding corresponding to the pair sequence.
add_special_tokens (bool) – Whether to add the special tokens

Returns

The final post-processed encoding

Return type

Encoding

post_processor¶: The optional PostProcessor in use by the Tokenizer

pre_tokenizer¶: The optional PreTokenizer in use by the Tokenizer

save(pretty=False)¶

Save the Tokenizer to the file at the given path.

Parameters

path (str) – A path to a file in which to save the serialized tokenizer.
pretty (bool, defaults to False) – Whether the JSON file should be pretty formatted.

to_str(pretty=False)¶

Gets a serialized string representing this Tokenizer.

Parameters: pretty (bool, defaults to False) – Whether the JSON string should be pretty formatted.
Returns: A string representing the serialized Tokenizer
Return type: str

token_to_id(token)¶

Convert the given token to its corresponding id if it exists

Parameters: token (str) – The token to convert
Returns: An optional id, None if out of vocabulary
Return type: Optional[int]

truncation¶

Get the currently set truncation parameters

Cannot set, use enable_truncation() instead

Returns: A dict with the current truncation parameters if truncation is enabled
Return type: (dict, optional)

Encoding¶

class tokenizers.Encoding¶

The Encoding represents the output of a Tokenizer.

attention_mask¶

The attention mask

This indicates to the LM which tokens should be attended to, and which should not. This is especially important when batching sequences, where we need to applying padding.

Returns: The attention mask
Return type: List[int]

char_to_token(char_pos, sequence_index=0)¶

Get the token that contains the char at the given position in the input sequence.

Parameters

char_pos (int) – The position of a char in the input string
sequence_index (int, defaults to 0) – The index of the sequence that contains the target char

Returns

The index of the token that contains this char in the encoded sequence

Return type

int

char_to_word(char_pos, sequence_index=0)¶

Get the word that contains the char at the given position in the input sequence.

Parameters

char_pos (int) – The position of a char in the input string
sequence_index (int, defaults to 0) – The index of the sequence that contains the target char

Returns

The index of the word that contains this char in the input sequence

Return type

int

ids¶

The generated IDs

The IDs are the main input to a Language Model. They are the token indices, the numerical representations that a LM understands.

Returns: The list of IDs
Return type: List[int]

static merge(encodings, growing_offsets=True)¶

Merge the list of encodings into one final Encoding

Parameters

encodings (A List of Encoding) – The list of encodings that should be merged in one
growing_offsets (bool, defaults to True) – Whether the offsets should accumulate while merging

Returns

The resulting Encoding

Return type

Encoding

n_sequences¶

The number of sequences represented

Returns: The number of sequences in this Encoding
Return type: int

offsets¶

The offsets associated to each token

These offsets let’s you slice the input string, and thus retrieve the original part that led to producing the corresponding token.

Returns: The list of offsets
Return type: A List of Tuple[int, int]

overflowing¶

A List of overflowing Encoding

When using truncation, the Tokenizer takes care of splitting the output into as many pieces as required to match the specified maximum length. This field lets you retrieve all the subsequent pieces.

When you use pairs of sequences, the overflowing pieces will contain enough variations to cover all the possible combinations, while respecting the provided maximum length.

pad(length, direction='right', pad_id=0, pad_type_id=0, pad_token='[PAD]')¶

Pad the Encoding at the given length

Parameters

length (int) – The desired length
direction – (str, defaults to right): The expected padding direction. Can be either right or left
pad_id (int, defaults to 0) – The ID corresponding to the padding token
pad_type_id (int, defaults to 0) – The type ID corresponding to the padding token
pad_token (str, defaults to [PAD]) – The pad token to use

sequence_ids¶

The generated sequence indices.

They represent the index of the input sequence associated to each token. The sequence id can be None if the token is not related to any input sequence, like for example with special tokens.

Returns: A list of optional sequence index.
Return type: A List of Optional[int]

set_sequence_id(sequence_id)¶

Set the given sequence index

Set the given sequence index for the whole range of tokens contained in this Encoding.

special_tokens_mask¶

The special token mask

This indicates which tokens are special tokens, and which are not.

Returns: The special tokens mask
Return type: List[int]

token_to_chars(token_index)¶

Get the offsets of the token at the given index.

The returned offsets are related to the input sequence that contains the token. In order to determine in which input sequence it belongs, you must call token_to_sequence().

Parameters: token_index (int) – The index of a token in the encoded sequence.
Returns: The token offsets (first, last + 1)
Return type: Tuple[int, int]

token_to_sequence(token_index)¶

Get the index of the sequence represented by the given token.

In the general use case, this method returns 0 for a single sequence or the first sequence of a pair, and 1 for the second sequence of a pair

Parameters: token_index (int) – The index of a token in the encoded sequence.
Returns: The sequence id of the given token
Return type: int

token_to_word(token_index)¶

Get the index of the word that contains the token in one of the input sequences.

The returned word index is related to the input sequence that contains the token. In order to determine in which input sequence it belongs, you must call token_to_sequence().

Parameters: token_index (int) – The index of a token in the encoded sequence.
Returns: The index of the word in the relevant input sequence.
Return type: int

tokens¶

The generated tokens

They are the string representation of the IDs.

Returns: The list of tokens
Return type: List[str]

truncate(max_length, stride=0)¶

Truncate the Encoding at the given length

If this Encoding represents multiple sequences, when truncating this information is lost. It will be considered as representing a single sequence.

Parameters

max_length (int) – The desired length
stride (int, defaults to 0) – The length of previous content to be included in each overflowing piece

type_ids¶

The generated type IDs

Generally used for tasks like sequence classification or question answering, these tokens let the LM know which input sequence corresponds to each tokens.

Returns: The list of type ids
Return type: List[int]

word_ids¶

The generated word indices.

They represent the index of the word associated to each token. When the input is pre-tokenized, they correspond to the ID of the given input label, otherwise they correspond to the words indices as defined by the PreTokenizer that was used.

For special tokens and such (any token that was generated from something that was not part of the input), the output is None

Returns: A list of optional word index.
Return type: A List of Optional[int]

word_to_chars(word_index, sequence_index=0)¶

Get the offsets of the word at the given index in one of the input sequences.

Parameters

word_index (int) – The index of a word in one of the input sequences.
sequence_index (int, defaults to 0) – The index of the sequence that contains the target word

Returns

The range of characters (span) (first, last + 1)

Return type

Tuple[int, int]

word_to_tokens(word_index, sequence_index=0)¶

Get the encoded tokens corresponding to the word at the given index in one of the input sequences.

Parameters

word_index (int) – The index of a word in one of the input sequences.
sequence_index (int, defaults to 0) – The index of the sequence that contains the target word

Returns

The range of tokens: (first, last + 1)

Return type

Tuple[int, int]

words¶

The generated word indices.

Warning

This is deprecated and will be removed in a future version. Please use word_ids instead.

They represent the index of the word associated to each token. When the input is pre-tokenized, they correspond to the ID of the given input label, otherwise they correspond to the words indices as defined by the PreTokenizer that was used.

For special tokens and such (any token that was generated from something that was not part of the input), the output is None

Returns: A list of optional word index.
Return type: A List of Optional[int]

Added Tokens¶

class tokenizers.AddedToken(self, content, single_word=False, lstrip=False, rstrip=False, normalized=True)¶

Represents a token that can be be added to a Tokenizer. It can have special options that defines the way it should behave.

Parameters

content (str) – The content of the token
single_word (bool, defaults to False) – Defines whether this token should only match single words. If True, this token will never match inside of a word. For example the token ing would match on tokenizing if this option is False, but not if it is True. The notion of “inside of a word” is defined by the word boundaries pattern in regular expressions (ie. the token should start and end with word boundaries).
lstrip (bool, defaults to False) – Defines whether this token should strip all potential whitespaces on its left side. If True, this token will greedily match any whitespace on its left. For example if we try to match the token [MASK] with lstrip=True, in the text "I saw a [MASK]", we would match on " [MASK]". (Note the space on the left).
rstrip (bool, defaults to False) – Defines whether this token should strip all potential whitespaces on its right side. If True, this token will greedily match any whitespace on its right. It works just like lstrip but on the right.
normalized (bool, defaults to True with add_tokens() and False with add_special_tokens()) – Defines whether this token should match against the normalized version of the input text. For example, with the added token "yesterday", and a normalizer in charge of lowercasing the text, the token could be extract from the input "I saw a lion Yesterday".

content¶: Get the content of this AddedToken

lstrip¶: Get the value of the lstrip option

normalized¶: Get the value of the normalized option

rstrip¶: Get the value of the rstrip option

single_word¶: Get the value of the single_word option