Tokenizers documentation
Decoders
Decoders
DecodeStream
Class needed for streaming decode
BPEDecoder
ByteFallback
ByteFallback Decoder
ByteFallback is a simple trick which converts tokens looking like <0x61>
to pure bytes, and attempts to make them into a string. If the tokens
cannot be decoded you will get � instead for each inconvertible byte token
ByteLevel
ByteLevel Decoder
This decoder is to be used in tandem with the ByteLevel PreTokenizer.
CTC
class tokenizers.decoders.CTC
( pad_token = '<pad>' word_delimiter_token = '|' cleanup = True )
Parameters
- pad_token (
str
, optional, defaults to<pad>
) — The pad token used by CTC to delimit a new token. - word_delimiter_token (
str
, optional, defaults to|
) — The word delimiter token. It will be replaced by a - cleanup (
bool
, optional, defaults toTrue
) — Whether to cleanup some tokenization artifacts. Mainly spaces before punctuation, and some abbreviated english forms.
CTC Decoder
Fuse
Fuse Decoder Fuse simply fuses every token into a single string. This is the last step of decoding, this decoder exists only if there is need to add other decoders after the fusion
Metaspace
class tokenizers.decoders.Metaspace
( )
Parameters
- replacement (
str
, optional, defaults to▁
) — The replacement character. Must be exactly one character. By default we use the ▁ (U+2581) meta symbol (Same as in SentencePiece). - prepend_scheme (
str
, optional, defaults to"always"
) — Whether to add a space to the first word if there isn’t already one. This lets us treat hello exactly like say hello. Choices: “always”, “never”, “first”. First means the space is only added on the first token (relevant when special tokens are used or other pre_tokenizer are used).
Metaspace Decoder
Replace
Replace Decoder
This decoder is to be used in tandem with the ~tokenizers.pre_tokenizers.Replace
PreTokenizer.
Sequence
Strip
Strip normalizer Strips n left characters of each token, or n right characters of each token