Tokenizers documentation
Input Sequences
You are viewing main version, which requires installation from source. If you'd like
regular pip install, checkout the latest stable version (v0.20.3).
Input Sequences
Python
Rust
Node
These types represent all the different kinds of sequence that can be used as input of a Tokenizer.
Globally, any sequence can be either a string or a list of strings, according to the operating
mode of the tokenizer: raw text
vs pre-tokenized
.
TextInputSequence
tokenizers.TextInputSequence
A str
that represents an input sequence
PreTokenizedInputSequence
tokenizers.PreTokenizedInputSequence
A pre-tokenized input sequence. Can be one of:
- A
List
ofstr
- A
Tuple
ofstr
alias of Union[List[str], Tuple[str]]
.
InputSequence
tokenizers.InputSequence
Represents all the possible types of input sequences for encoding. Can be:
- When
is_pretokenized=False
: TextInputSequence - When
is_pretokenized=True
: PreTokenizedInputSequence
alias of Union[str, List[str], Tuple[str]]
.