huggingface-course
/

albert-tokenizer-without-normalizer

SaulLu commited on Oct 19, 2021

Commit

67b371b

1 Parent(s): eebcad8

add modified albert tokenizer

Files changed (3) hide show

README.md ADDED Viewed

+The purpose of this repo is to show the usefulness of saving the normalization operation used during the tokenizer training
+```python
+from transformers import AutoTokenizer
+text = "This is a text with àccënts and CAPITAL LETTERS"
+tokenizer = AutoTokenizer.from_pretrained("albert-large-v2")
+print(tokenizer.convert_ids_to_tokens(tokenizer.encode(text)))
+# ['[CLS]', '▁this', '▁is', '▁a', '▁text', '▁with', '▁accent', 's', '▁and', '▁capital', '▁letters', '[SEP]']
+tokenizer = AutoTokenizer.from_pretrained("huggingface-course/albert-tokenizer-without-normalizer")
+print(tokenizer.convert_ids_to_tokens(tokenizer.encode(text)))
+#
+['[CLS]', '▁', '<unk>', 'his', '▁is', '▁a', '▁text', '▁with', '▁', '<unk>', 'cc', '<unk>', 'nts', '▁and', '▁', '<unk>', '▁', '<unk>', '[SEP]']
+```

tokenizer.json ADDED Viewed

The diff for this file is too large to render. See raw diff

tokenizer_config.json ADDED Viewed

+{
+  "tokenizer_class": "AlbertTokenizer"
+}