You are viewing main version, which requires installation from source. If you'd like
regular pip install, checkout the latest stable version (v0.20.3).
Decoders
Python
Rust
Node
BPEDecoder
ByteLevel
ByteLevel Decoder
This decoder is to be used in tandem with the ByteLevel PreTokenizer.
CTC
class tokenizers.decoders.CTC
( pad_token = '<pad>' word_delimiter_token = '|' cleanup = True )
Parameters
- pad_token (
str
, optional, defaults to<pad>
) — The pad token used by CTC to delimit a new token. - word_delimiter_token (
str
, optional, defaults to|
) — The word delimiter token. It will be replaced by a - cleanup (
bool
, optional, defaults toTrue
) — Whether to cleanup some tokenization artifacts. Mainly spaces before punctuation, and some abbreviated english forms.
CTC Decoder
Metaspace
class tokenizers.decoders.Metaspace
( )
Parameters
- replacement (
str
, optional, defaults to▁
) — The replacement character. Must be exactly one character. By default we use the ▁ (U+2581) meta symbol (Same as in SentencePiece). - prepend_scheme (
str
, optional, defaults to"always"
) — Whether to add a space to the first word if there isn’t already one. This lets us treat hello exactly like say hello. Choices: “always”, “never”, “first”. First means the space is only added on the first token (relevant when special tokens are used or other pre_tokenizer are used).
Metaspace Decoder