Tokenizers documentation
Decoders
You are viewing v0.13.3 version.
			
				A newer version
					v0.20.3 is available.
Decoders
			Python
		
			Rust
		
			Node
		
BPEDecoder
ByteLevel
ByteLevel Decoder
This decoder is to be used in tandem with the ByteLevel PreTokenizer.
CTC
class tokenizers.decoders.CTC
( pad_token = '<pad>' word_delimiter_token = '|' cleanup = True )
Parameters
- 
							pad_token (str, optional, defaults to<pad>) — The pad token used by CTC to delimit a new token.
- 
							word_delimiter_token (str, optional, defaults to|) — The word delimiter token. It will be replaced by a
- 
							cleanup (bool, optional, defaults toTrue) — Whether to cleanup some tokenization artifacts. Mainly spaces before punctuation, and some abbreviated english forms.
CTC Decoder
Metaspace
class tokenizers.decoders.Metaspace
( )
Parameters
- 
							replacement (str, optional, defaults to▁) — The replacement character. Must be exactly one character. By default we use the ▁ (U+2581) meta symbol (Same as in SentencePiece).
- 
							add_prefix_space (bool, optional, defaults toTrue) — Whether to add a space to the first word if there isn’t already one. This lets us treat hello exactly like say hello.
Metaspace Decoder