Tokenizers documentation
Normalizers
Normalizers
ByteLevel
Lowercase
NFC
NFD
NFKC
NFKD
Nmt
Normalizer
Base class for all normalizers
This class is not supposed to be instantiated directly. Instead, any implementation of a Normalizer will return an instance of this class when instantiated.
normalize
( normalized )
Parameters
- normalized (
NormalizedString
) — The normalized string on which to apply this Normalizer
Normalize a NormalizedString
in-place
This method allows to modify a NormalizedString
to
keep track of the alignment information. If you just want to see the result
of the normalization on a raw string, you can use
normalize_str()
Normalize the given string
This method provides a way to visualize the effect of a
Normalizer but it does not keep track of the alignment
information. If you need to get/convert offsets, you can use
normalize()
Precompiled
Precompiled normalizer Don’t use manually it is used for compatibility for SentencePiece.
Replace
Sequence
Allows concatenating multiple other Normalizer as a Sequence. All the normalizers run in sequence in the given order
Strip
StripAccents
BertNormalizer
class tokenizers.normalizers.BertNormalizer
( clean_text = True handle_chinese_chars = True strip_accents = None lowercase = True )
Parameters
- clean_text (
bool
, optional, defaults toTrue
) — Whether to clean the text, by removing any control characters and replacing all whitespaces by the classic one. - handle_chinese_chars (
bool
, optional, defaults toTrue
) — Whether to handle chinese chars by putting spaces around them. - strip_accents (
bool
, optional) — Whether to strip all accents. If this option is not specified (ie == None), then it will be determined by the value for lowercase (as in the original Bert). - lowercase (
bool
, optional, defaults toTrue
) — Whether to lowercase.
BertNormalizer
Takes care of normalizing raw text before giving it to a Bert model. This includes cleaning the text, handling accents, chinese chars and lowercasing