Normalizers

Python

Rust

Node

ByteLevel

class tokenizers.normalizers.ByteLevel

( )

Bytelevel Normalizer

Lowercase

class tokenizers.normalizers.Lowercase

( )

Lowercase Normalizer

NFC

class tokenizers.normalizers.NFC

( )

NFC Unicode Normalizer

NFD

class tokenizers.normalizers.NFD

( )

NFD Unicode Normalizer

NFKC

class tokenizers.normalizers.NFKC

( )

NFKC Unicode Normalizer

NFKD

class tokenizers.normalizers.NFKD

( )

NFKD Unicode Normalizer

Nmt

class tokenizers.normalizers.Nmt

( )

Nmt normalizer

Normalizer

class tokenizers.normalizers.Normalizer

( )

Base class for all normalizers

This class is not supposed to be instantiated directly. Instead, any implementation of a Normalizer will return an instance of this class when instantiated.

normalize

( normalized )

Parameters

normalized (NormalizedString) — The normalized string on which to apply this Normalizer

Normalize a NormalizedString in-place

This method allows to modify a NormalizedString to keep track of the alignment information. If you just want to see the result of the normalization on a raw string, you can use normalize_str()

normalize_str

( sequence ) → str

Parameters

sequence (str) — A string to normalize

Returns

str

A string after normalization

Normalize the given string

This method provides a way to visualize the effect of a Normalizer but it does not keep track of the alignment information. If you need to get/convert offsets, you can use normalize()

Precompiled

class tokenizers.normalizers.Precompiled

( precompiled_charsmap )

Precompiled normalizer Don’t use manually it is used for compatibility for SentencePiece.

Replace

class tokenizers.normalizers.Replace

( pattern content )

Replace normalizer

Sequence

class tokenizers.normalizers.Sequence

( )

Parameters

normalizers (List[Normalizer]) — A list of Normalizer to be run as a sequence

Allows concatenating multiple other Normalizer as a Sequence. All the normalizers run in sequence in the given order

Strip

class tokenizers.normalizers.Strip

( left = True right = True )

Strip normalizer

StripAccents

class tokenizers.normalizers.StripAccents

( )

StripAccents normalizer

BertNormalizer

class tokenizers.normalizers.BertNormalizer

( clean_text = True handle_chinese_chars = True strip_accents = None lowercase = True )

Parameters

clean_text (bool, optional, defaults to True) — Whether to clean the text, by removing any control characters and replacing all whitespaces by the classic one.
handle_chinese_chars (bool, optional, defaults to True) — Whether to handle chinese chars by putting spaces around them.
strip_accents (bool, optional) — Whether to strip all accents. If this option is not specified (ie == None), then it will be determined by the value for lowercase (as in the original Bert).
lowercase (bool, optional, defaults to True) — Whether to lowercase.

BertNormalizer

Takes care of normalizing raw text before giving it to a Bert model. This includes cleaning the text, handling accents, chinese chars and lowercasing

Update on GitHub

Tokenizers

Normalizers

ByteLevel

class tokenizers.normalizers.ByteLevel

Lowercase

class tokenizers.normalizers.Lowercase

NFC

class tokenizers.normalizers.NFC

NFD

class tokenizers.normalizers.NFD

NFKC

class tokenizers.normalizers.NFKC

NFKD

class tokenizers.normalizers.NFKD

Nmt

class tokenizers.normalizers.Nmt

Normalizer

class tokenizers.normalizers.Normalizer

normalize

normalize_str

Precompiled

class tokenizers.normalizers.Precompiled

Replace

class tokenizers.normalizers.Replace

Sequence

class tokenizers.normalizers.Sequence

Strip

class tokenizers.normalizers.Strip

StripAccents

class tokenizers.normalizers.StripAccents

BertNormalizer

class tokenizers.normalizers.BertNormalizer