Decoders

Python

Rust

Node

DecodeStream

class tokenizers.decoders.DecodeStream

( ids = None skip_special_tokens = False )

Class needed for streaming decode

BPEDecoder

class tokenizers.decoders.BPEDecoder

( suffix = '</w>' )

Parameters

suffix (str, optional, defaults to </w>) — The suffix that was used to characterize an end-of-word. This suffix will be replaced by whitespaces during the decoding

BPEDecoder Decoder

ByteFallback

class tokenizers.decoders.ByteFallback

( )

ByteFallback Decoder ByteFallback is a simple trick which converts tokens looking like <0x61> to pure bytes, and attempts to make them into a string. If the tokens cannot be decoded you will get � instead for each inconvertible byte token

ByteLevel

class tokenizers.decoders.ByteLevel

( )

ByteLevel Decoder

This decoder is to be used in tandem with the ByteLevel PreTokenizer.

CTC

class tokenizers.decoders.CTC

( pad_token = '<pad>' word_delimiter_token = '|' cleanup = True )

Parameters

pad_token (str, optional, defaults to <pad>) — The pad token used by CTC to delimit a new token.
word_delimiter_token (str, optional, defaults to |) — The word delimiter token. It will be replaced by a
cleanup (bool, optional, defaults to True) — Whether to cleanup some tokenization artifacts. Mainly spaces before punctuation, and some abbreviated english forms.

CTC Decoder

Fuse

class tokenizers.decoders.Fuse

( )

Fuse Decoder Fuse simply fuses every token into a single string. This is the last step of decoding, this decoder exists only if there is need to add other decoders after the fusion

Metaspace

class tokenizers.decoders.Metaspace

( )

Parameters

replacement (str, optional, defaults to ▁) — The replacement character. Must be exactly one character. By default we use the ▁ (U+2581) meta symbol (Same as in SentencePiece).
prepend_scheme (str, optional, defaults to "always") — Whether to add a space to the first word if there isn’t already one. This lets us treat hello exactly like say hello. Choices: “always”, “never”, “first”. First means the space is only added on the first token (relevant when special tokens are used or other pre_tokenizer are used).

Metaspace Decoder

Replace

class tokenizers.decoders.Replace

( pattern content )

Replace Decoder

This decoder is to be used in tandem with the ~tokenizers.pre_tokenizers.Replace PreTokenizer.

Sequence

class tokenizers.decoders.Sequence

( decoders )

Parameters

decoders (List[Decoder]) — The decoders that need to be chained

Sequence Decoder

Strip

class tokenizers.decoders.Strip

( content left = 0 right = 0 )

Strip normalizer Strips n left characters of each token, or n right characters of each token

WordPiece

class tokenizers.decoders.WordPiece

( prefix = '##' cleanup = True )

Parameters

prefix (str, optional, defaults to ##) — The prefix to use for subwords that are not a beginning-of-word
cleanup (bool, optional, defaults to True) — Whether to cleanup some tokenization artifacts. Mainly spaces before punctuation, and some abbreviated english forms.

WordPiece Decoder

Update on GitHub

Tokenizers

Decoders

DecodeStream

class tokenizers.decoders.DecodeStream

BPEDecoder

class tokenizers.decoders.BPEDecoder

ByteFallback

class tokenizers.decoders.ByteFallback

ByteLevel

class tokenizers.decoders.ByteLevel

CTC

class tokenizers.decoders.CTC

Fuse

class tokenizers.decoders.Fuse

Metaspace

class tokenizers.decoders.Metaspace

Replace

class tokenizers.decoders.Replace

Sequence

class tokenizers.decoders.Sequence

Strip

class tokenizers.decoders.Strip

WordPiece

class tokenizers.decoders.WordPiece