Tokenizers documentation
Encode Inputs
You are viewing main version, which requires installation from source. If you'd like
regular pip install, checkout the latest stable version (v0.20.3).
Encode Inputs
Python
Rust
Node
These types represent all the different kinds of input that a Tokenizer accepts
when using encode_batch()
.
TextEncodeInput[[[ tokenizers.TextEncodeInput ]]]
tokenizers.TextEncodeInput
Represents a textual input for encoding. Can be either:
- A single sequence: TextInputSequence
- A pair of sequences:
- A Tuple of TextInputSequence
- Or a List of TextInputSequence of size 2
alias of Union[str, Tuple[str, str], List[str]]
.
PreTokenizedEncodeInput[[[ tokenizers.PreTokenizedEncodeInput ]]]
tokenizers.PreTokenizedEncodeInput
Represents a pre-tokenized input for encoding. Can be either:
- A single sequence: PreTokenizedInputSequence
- A pair of sequences:
- A Tuple of PreTokenizedInputSequence
- Or a List of PreTokenizedInputSequence of size 2
alias of Union[List[str], Tuple[str], Tuple[Union[List[str], Tuple[str]], Union[List[str], Tuple[str]]], List[Union[List[str], Tuple[str]]]]
.
EncodeInput[[[ tokenizers.EncodeInput ]]]
tokenizers.EncodeInput
Represents all the possible types of input for encoding. Can be:
- When
is_pretokenized=False
: TextEncodeInput - When
is_pretokenized=True
: PreTokenizedEncodeInput
alias of Union[str, Tuple[str, str], List[str], Tuple[str], Tuple[Union[List[str], Tuple[str]], Union[List[str], Tuple[str]]], List[Union[List[str], Tuple[str]]]]
.