Tokenizers documentation

Decoders

Hugging Face's logo
Join the Hugging Face community

and get access to the augmented documentation experience

to get started

Decoders

Python
Rust
Node

BPEDecoder

class tokenizers.decoders.BPEDecoder

( suffix = '</w>' )

Parameters

  • suffix (str, optional, defaults to </w>) — The suffix that was used to caracterize an end-of-word. This suffix will be replaced by whitespaces during the decoding

BPEDecoder Decoder

ByteLevel

class tokenizers.decoders.ByteLevel

( )

ByteLevel Decoder

This decoder is to be used in tandem with the ByteLevel PreTokenizer.

CTC

class tokenizers.decoders.CTC

( pad_token = '<pad>' word_delimiter_token = '|' cleanup = True )

Parameters

  • pad_token (str, optional, defaults to <pad>) — The pad token used by CTC to delimit a new token.
  • word_delimiter_token (str, optional, defaults to |) — The word delimiter token. It will be replaced by a
  • cleanup (bool, optional, defaults to True) — Whether to cleanup some tokenization artifacts. Mainly spaces before punctuation, and some abbreviated english forms.

CTC Decoder

Metaspace

class tokenizers.decoders.Metaspace

( )

Parameters

  • replacement (str, optional, defaults to ) — The replacement character. Must be exactly one character. By default we use the (U+2581) meta symbol (Same as in SentencePiece).
  • add_prefix_space (bool, optional, defaults to True) — Whether to add a space to the first word if there isn’t already one. This lets us treat hello exactly like say hello.

Metaspace Decoder

WordPiece

class tokenizers.decoders.WordPiece

( prefix = '##' cleanup = True )

Parameters

  • prefix (str, optional, defaults to ##) — The prefix to use for subwords that are not a beginning-of-word
  • cleanup (bool, optional, defaults to True) — Whether to cleanup some tokenization artifacts. Mainly spaces before punctuation, and some abbreviated english forms.

WordPiece Decoder