Input sequences¶
These types represent all the different kinds of sequence that can be used as input of a Tokenizer.
Globally, any sequence can be either a string or a list of strings, according to the operating
mode of the tokenizer: raw text
vs pre-tokenized
.
-
tokenizers.
TextInputSequence
= <class 'str'>¶ A
str
that represents an input sequence
-
tokenizers.
PreTokenizedInputSequence
= typing.Union[typing.List[str], typing.Tuple[str]]¶ A pre-tokenized input sequence. Can be one of:
A
List
ofstr
A
Tuple
ofstr
-
tokenizers.
InputSequence
= typing.Union[str, typing.List[str], typing.Tuple[str]]¶ Represents all the possible types of input sequences for encoding. Can be:
When
is_pretokenized=False
:TextInputSequence
When
is_pretokenized=True
:PreTokenizedInputSequence
Encode inputs¶
These types represent all the different kinds of input that a Tokenizer
accepts
when using encode_batch()
.
-
tokenizers.
TextEncodeInput
= typing.Union[str, typing.Tuple[str, str], typing.List[str]]¶ Represents a textual input for encoding. Can be either:
A single sequence:
TextInputSequence
A pair of sequences:
A
Tuple
ofTextInputSequence
Or a
List
ofTextInputSequence
of size 2
-
tokenizers.
PreTokenizedEncodeInput
= typing.Union[typing.List[str], typing.Tuple[str], typing.Tuple[typing.Union[typing.List[str], typing.Tuple[str]], typing.Union[typing.List[str], typing.Tuple[str]]], typing.List[typing.Union[typing.List[str], typing.Tuple[str]]]]¶ Represents a pre-tokenized input for encoding. Can be either:
A single sequence:
PreTokenizedInputSequence
A pair of sequences:
A
Tuple
ofPreTokenizedInputSequence
Or a
List
ofPreTokenizedInputSequence
of size 2
-
tokenizers.
EncodeInput
= typing.Union[str, typing.Tuple[str, str], typing.List[str], typing.Tuple[str], typing.Tuple[typing.Union[typing.List[str], typing.Tuple[str]], typing.Union[typing.List[str], typing.Tuple[str]]], typing.List[typing.Union[typing.List[str], typing.Tuple[str]]]]¶ Represents all the possible types of input for encoding. Can be:
When
is_pretokenized=False
:TextEncodeInput
When
is_pretokenized=True
:PreTokenizedEncodeInput
Tokenizer¶
-
class
tokenizers.
Tokenizer
(self, model)¶ A
Tokenizer
works as a pipeline. It processes some raw text as input and outputs anEncoding
.- Parameters
model (
Model
) – The core algorithm that thisTokenizer
should be using.
-
add_special_tokens
(tokens)¶ Add the given special tokens to the Tokenizer.
If these tokens are already part of the vocabulary, it just let the Tokenizer know about them. If they don’t exist, the Tokenizer creates them, giving them a new id.
These special tokens will never be processed by the model (ie won’t be split into multiple tokens), and they can be removed from the output when decoding.
- Parameters
tokens (A
List
ofAddedToken
orstr
) – The list of special tokens we want to add to the vocabulary. Each token can either be a string or an instance ofAddedToken
for more customization.- Returns
The number of tokens that were created in the vocabulary
- Return type
int
-
add_tokens
(tokens)¶ Add the given tokens to the vocabulary
The given tokens are added only if they don’t already exist in the vocabulary. Each token then gets a new attributed id.
- Parameters
tokens (A
List
ofAddedToken
orstr
) – The list of tokens we want to add to the vocabulary. Each token can be either a string or an instance ofAddedToken
for more customization.- Returns
The number of tokens that were created in the vocabulary
- Return type
int
-
decode
(ids, skip_special_tokens=True)¶ Decode the given list of ids back to a string
This is used to decode anything coming back from a Language Model
- Parameters
ids (A
List/Tuple
ofint
) – The list of ids that we want to decodeskip_special_tokens (
bool
, defaults toTrue
) – Whether the special tokens should be removed from the decoded string
- Returns
The decoded string
- Return type
str
-
decode_batch
(sequences, skip_special_tokens=True)¶ Decode a batch of ids back to their corresponding string
- Parameters
sequences (
List
ofList[int]
) – The batch of sequences we want to decodeskip_special_tokens (
bool
, defaults toTrue
) – Whether the special tokens should be removed from the decoded strings
- Returns
A list of decoded strings
- Return type
List[str]
-
decoder
¶ The optional
Decoder
in use by the Tokenizer
-
enable_padding
(direction='right', pad_id=0, pad_type_id=0, pad_token='[PAD]', length=None, pad_to_multiple_of=None)¶ Enable the padding
- Parameters
direction (
str
, optional, defaults toright
) – The direction in which to pad. Can be eitherright
orleft
pad_to_multiple_of (
int
, optional) – If specified, the padding length should always snap to the next multiple of the given value. For example if we were going to pad witha length of 250 butpad_to_multiple_of=8
then we will pad to 256.pad_id (
int
, defaults to 0) – The id to be used when paddingpad_type_id (
int
, defaults to 0) – The type id to be used when paddingpad_token (
str
, defaults to[PAD]
) – The pad token to be used when paddinglength (
int
, optional) – If specified, the length at which to pad. If not specified we pad using the size of the longest sequence in a batch.
-
enable_truncation
(max_length, stride=0, strategy='longest_first')¶ Enable truncation
- Parameters
max_length (
int
) – The max length at which to truncatestride (
int
, optional) – The length of the previous first sequence to be included in the overflowing sequencestrategy (
str
, optional, defaults tolongest_first
) – The strategy used to truncation. Can be one oflongest_first
,only_first
oronly_second
.
-
encode
(sequence, pair=None, is_pretokenized=False, add_special_tokens=True)¶ Encode the given sequence and pair. This method can process raw text sequences as well as already pre-tokenized sequences.
Example
Here are some examples of the inputs that are accepted:
encode("A single sequence")` encode("A sequence", "And its pair")` encode([ "A", "pre", "tokenized", "sequence" ], is_pretokenized=True)` encode( [ "A", "pre", "tokenized", "sequence" ], [ "And", "its", "pair" ], is_pretokenized=True )
- Parameters
sequence (
InputSequence
) –The main input sequence we want to encode. This sequence can be either raw text or pre-tokenized, according to the
is_pretokenized
argument:If
is_pretokenized=False
:TextInputSequence
If
is_pretokenized=True
:PreTokenizedInputSequence
pair (
InputSequence
, optional) – An optional input sequence. The expected format is the same that forsequence
.is_pretokenized (
bool
, defaults toFalse
) – Whether the input is already pre-tokenizedadd_special_tokens (
bool
, defaults toTrue
) – Whether to add the special tokens
- Returns
The encoded result
- Return type
-
encode_batch
(input, is_pretokenized=False, add_special_tokens=True)¶ Encode the given batch of inputs. This method accept both raw text sequences as well as already pre-tokenized sequences.
Example
Here are some examples of the inputs that are accepted:
encode_batch([ "A single sequence", ("A tuple with a sequence", "And its pair"), [ "A", "pre", "tokenized", "sequence" ], ([ "A", "pre", "tokenized", "sequence" ], "And its pair") ])
- Parameters
input (A
List
/Tuple
ofEncodeInput
) –A list of single sequences or pair sequences to encode. Each sequence can be either raw text or pre-tokenized, according to the
is_pretokenized
argument:If
is_pretokenized=False
:TextEncodeInput
If
is_pretokenized=True
:PreTokenizedEncodeInput
is_pretokenized (
bool
, defaults toFalse
) – Whether the input is already pre-tokenizedadd_special_tokens (
bool
, defaults toTrue
) – Whether to add the special tokens
- Returns
The encoded batch
- Return type
A
List
ofEncoding
-
get_vocab
(with_added_tokens=True)¶ Get the underlying vocabulary
- Parameters
with_added_tokens (
bool
, defaults toTrue
) – Whether to include the added tokens- Returns
The vocabulary
- Return type
Dict[str, int]
-
get_vocab_size
(with_added_tokens=True)¶ Get the size of the underlying vocabulary
- Parameters
with_added_tokens (
bool
, defaults toTrue
) – Whether to include the added tokens- Returns
The size of the vocabulary
- Return type
int
-
id_to_token
(id)¶ Convert the given id to its corresponding token if it exists
- Parameters
id (
int
) – The id to convert- Returns
An optional token,
None
if out of vocabulary- Return type
Optional[str]
-
model
¶ The
Model
in use by the Tokenizer
-
no_padding
()¶ Disable padding
-
no_truncation
()¶ Disable truncation
-
normalizer
¶ The optional
Normalizer
in use by the Tokenizer
-
num_special_tokens_to_add
(is_pair)¶ Return the number of special tokens that would be added for single/pair sentences. :param is_pair: Boolean indicating if the input would be a single sentence or a pair :return:
-
padding
¶ Get the current padding parameters
Cannot be set, use
enable_padding()
instead- Returns
A dict with the current padding parameters if padding is enabled
- Return type
(
dict
, optional)
-
post_process
(encoding, pair=None, add_special_tokens=True)¶ Apply all the post-processing steps to the given encodings.
The various steps are:
Truncate according to the set truncation params (provided with
enable_truncation()
)Apply the
PostProcessor
Pad according to the set padding params (provided with
enable_padding()
)
-
post_processor
¶ The optional
PostProcessor
in use by the Tokenizer
-
pre_tokenizer
¶ The optional
PreTokenizer
in use by the Tokenizer
-
save
(pretty=False)¶ Save the
Tokenizer
to the file at the given path.- Parameters
path (
str
) – A path to a file in which to save the serialized tokenizer.pretty (
bool
, defaults toFalse
) – Whether the JSON file should be pretty formatted.
-
to_str
(pretty=False)¶ Gets a serialized string representing this
Tokenizer
.- Parameters
pretty (
bool
, defaults toFalse
) – Whether the JSON string should be pretty formatted.- Returns
A string representing the serialized Tokenizer
- Return type
str
-
token_to_id
(token)¶ Convert the given token to its corresponding id if it exists
- Parameters
token (
str
) – The token to convert- Returns
An optional id,
None
if out of vocabulary- Return type
Optional[int]
-
truncation
¶ Get the currently set truncation parameters
Cannot set, use
enable_truncation()
instead- Returns
A dict with the current truncation parameters if truncation is enabled
- Return type
(
dict
, optional)
Encoding¶
-
class
tokenizers.
Encoding
¶ The
Encoding
represents the output of aTokenizer
.-
attention_mask
¶ The attention mask
This indicates to the LM which tokens should be attended to, and which should not. This is especially important when batching sequences, where we need to applying padding.
- Returns
The attention mask
- Return type
List[int]
-
char_to_token
(char_pos, sequence_index=0)¶ Get the token that contains the char at the given position in the input sequence.
- Parameters
char_pos (
int
) – The position of a char in the input stringsequence_index (
int
, defaults to0
) – The index of the sequence that contains the target char
- Returns
The index of the token that contains this char in the encoded sequence
- Return type
int
-
char_to_word
(char_pos, sequence_index=0)¶ Get the word that contains the char at the given position in the input sequence.
- Parameters
char_pos (
int
) – The position of a char in the input stringsequence_index (
int
, defaults to0
) – The index of the sequence that contains the target char
- Returns
The index of the word that contains this char in the input sequence
- Return type
int
-
ids
¶ The generated IDs
The IDs are the main input to a Language Model. They are the token indices, the numerical representations that a LM understands.
- Returns
The list of IDs
- Return type
List[int]
-
n_sequences
¶ The number of sequences represented
- Returns
The number of sequences in this
Encoding
- Return type
int
-
offsets
¶ The offsets associated to each token
These offsets let’s you slice the input string, and thus retrieve the original part that led to producing the corresponding token.
- Returns
The list of offsets
- Return type
A
List
ofTuple[int, int]
-
overflowing
¶ A
List
of overflowingEncoding
When using truncation, the
Tokenizer
takes care of splitting the output into as many pieces as required to match the specified maximum length. This field lets you retrieve all the subsequent pieces.When you use pairs of sequences, the overflowing pieces will contain enough variations to cover all the possible combinations, while respecting the provided maximum length.
-
pad
(length, direction='right', pad_id=0, pad_type_id=0, pad_token='[PAD]')¶ Pad the
Encoding
at the given length- Parameters
length (
int
) – The desired lengthdirection – (
str
, defaults toright
): The expected padding direction. Can be eitherright
orleft
pad_id (
int
, defaults to0
) – The ID corresponding to the padding tokenpad_type_id (
int
, defaults to0
) – The type ID corresponding to the padding tokenpad_token (
str
, defaults to [PAD]) – The pad token to use
-
sequence_ids
¶ The generated sequence indices.
They represent the index of the input sequence associated to each token. The sequence id can be None if the token is not related to any input sequence, like for example with special tokens.
- Returns
A list of optional sequence index.
- Return type
A
List
ofOptional[int]
-
set_sequence_id
(sequence_id)¶ Set the given sequence index
Set the given sequence index for the whole range of tokens contained in this
Encoding
.
-
special_tokens_mask
¶ The special token mask
This indicates which tokens are special tokens, and which are not.
- Returns
The special tokens mask
- Return type
List[int]
-
token_to_chars
(token_index)¶ Get the offsets of the token at the given index.
The returned offsets are related to the input sequence that contains the token. In order to determine in which input sequence it belongs, you must call
token_to_sequence()
.- Parameters
token_index (
int
) – The index of a token in the encoded sequence.- Returns
The token offsets
(first, last + 1)
- Return type
Tuple[int, int]
-
token_to_sequence
(token_index)¶ Get the index of the sequence represented by the given token.
In the general use case, this method returns
0
for a single sequence or the first sequence of a pair, and1
for the second sequence of a pair- Parameters
token_index (
int
) – The index of a token in the encoded sequence.- Returns
The sequence id of the given token
- Return type
int
-
token_to_word
(token_index)¶ Get the index of the word that contains the token in one of the input sequences.
The returned word index is related to the input sequence that contains the token. In order to determine in which input sequence it belongs, you must call
token_to_sequence()
.- Parameters
token_index (
int
) – The index of a token in the encoded sequence.- Returns
The index of the word in the relevant input sequence.
- Return type
int
-
tokens
¶ The generated tokens
They are the string representation of the IDs.
- Returns
The list of tokens
- Return type
List[str]
-
truncate
(max_length, stride=0)¶ Truncate the
Encoding
at the given lengthIf this
Encoding
represents multiple sequences, when truncating this information is lost. It will be considered as representing a single sequence.- Parameters
max_length (
int
) – The desired lengthstride (
int
, defaults to0
) – The length of previous content to be included in each overflowing piece
-
type_ids
¶ The generated type IDs
Generally used for tasks like sequence classification or question answering, these tokens let the LM know which input sequence corresponds to each tokens.
- Returns
The list of type ids
- Return type
List[int]
-
word_ids
¶ The generated word indices.
They represent the index of the word associated to each token. When the input is pre-tokenized, they correspond to the ID of the given input label, otherwise they correspond to the words indices as defined by the
PreTokenizer
that was used.For special tokens and such (any token that was generated from something that was not part of the input), the output is
None
- Returns
A list of optional word index.
- Return type
A
List
ofOptional[int]
-
word_to_chars
(word_index, sequence_index=0)¶ Get the offsets of the word at the given index in one of the input sequences.
- Parameters
word_index (
int
) – The index of a word in one of the input sequences.sequence_index (
int
, defaults to0
) – The index of the sequence that contains the target word
- Returns
The range of characters (span)
(first, last + 1)
- Return type
Tuple[int, int]
-
word_to_tokens
(word_index, sequence_index=0)¶ Get the encoded tokens corresponding to the word at the given index in one of the input sequences.
- Parameters
word_index (
int
) – The index of a word in one of the input sequences.sequence_index (
int
, defaults to0
) – The index of the sequence that contains the target word
- Returns
The range of tokens:
(first, last + 1)
- Return type
Tuple[int, int]
-
words
¶ The generated word indices.
Warning
This is deprecated and will be removed in a future version. Please use
word_ids
instead.They represent the index of the word associated to each token. When the input is pre-tokenized, they correspond to the ID of the given input label, otherwise they correspond to the words indices as defined by the
PreTokenizer
that was used.For special tokens and such (any token that was generated from something that was not part of the input), the output is
None
- Returns
A list of optional word index.
- Return type
A
List
ofOptional[int]
-
Added Tokens¶
-
class
tokenizers.
AddedToken
(self, content, single_word=False, lstrip=False, rstrip=False, normalized=True)¶ Represents a token that can be be added to a
Tokenizer
. It can have special options that defines the way it should behave.- Parameters
content (
str
) – The content of the tokensingle_word (
bool
, defaults toFalse
) – Defines whether this token should only match single words. IfTrue
, this token will never match inside of a word. For example the tokening
would match ontokenizing
if this option isFalse
, but not if it isTrue
. The notion of “inside of a word” is defined by the word boundaries pattern in regular expressions (ie. the token should start and end with word boundaries).lstrip (
bool
, defaults toFalse
) – Defines whether this token should strip all potential whitespaces on its left side. IfTrue
, this token will greedily match any whitespace on its left. For example if we try to match the token[MASK]
withlstrip=True
, in the text"I saw a [MASK]"
, we would match on" [MASK]"
. (Note the space on the left).rstrip (
bool
, defaults toFalse
) – Defines whether this token should strip all potential whitespaces on its right side. IfTrue
, this token will greedily match any whitespace on its right. It works just likelstrip
but on the right.normalized (
bool
, defaults toTrue
withadd_tokens()
andFalse
withadd_special_tokens()
) – Defines whether this token should match against the normalized version of the input text. For example, with the added token"yesterday"
, and a normalizer in charge of lowercasing the text, the token could be extract from the input"I saw a lion Yesterday"
.
-
content
¶ Get the content of this
AddedToken
-
normalized
¶ Get the value of the
normalized
option
-
single_word
¶ Get the value of the
single_word
option