--- library_name: tokenizers license: cc-by-sa-3.0 datasets: - wikitext language: - en tags: - tokenizer - wordlevel - tokenizers - wikitext inference: false --- # WikiText-WordLevel This is a simple word-level tokenizer created using the [Tokenizers](https://github.com/huggingface/tokenizers) library. It was trained for educational purposes on the combined train, validation, and test splits of the [WikiText-103](https://huggingface.co/datasets/wikitext) corpus. - Tokenizer Type: Word-Level - Vocabulary Size: 75K - Special Tokens: `` (start of sequence), `` (end of sequence), `` (unknown token) - Normalization: [NFC](https://en.wikipedia.org/wiki/Unicode_equivalence#Normal_forms) (Normalization Form Canonical Composition), Strip, Lowercase - Pre-tokenization: Whitespace - Code: [wikitext-wordlevel.py](wikitext-wordlevel.py) The tokenizer can be used as simple as follows. ```python tokenizer = Tokenizer.from_pretrained('dustalov/wikitext-wordlevel') tokenizer.encode("I'll see you soon").ids # => [68, 14, 2746, 577, 184, 595] tokenizer.encode("I'll see you soon").tokens # => ['i', "'", 'll', 'see', 'you', 'soon'] tokenizer.decode([68, 14, 2746, 577, 184, 595]) # => "i ' ll see you soon" ```