dustalov commited on
Commit
b0f9f16
1 Parent(s): ef99e22

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +12 -0
README.md CHANGED
@@ -23,3 +23,15 @@ This is a simple word-level tokenizer created using the [Tokenizers](https://git
23
  - Normalization: [NFC](https://en.wikipedia.org/wiki/Unicode_equivalence#Normal_forms) (Normalization Form Canonical Composition), Strip, Lowercase
24
  - Pre-tokenization: Whitespace
25
  - Code: [wikitext-wordlevel.py](wikitext-wordlevel.py)
 
 
 
 
 
 
 
 
 
 
 
 
 
23
  - Normalization: [NFC](https://en.wikipedia.org/wiki/Unicode_equivalence#Normal_forms) (Normalization Form Canonical Composition), Strip, Lowercase
24
  - Pre-tokenization: Whitespace
25
  - Code: [wikitext-wordlevel.py](wikitext-wordlevel.py)
26
+
27
+ The tokenizer can be used as simple as follows.
28
+
29
+ ```python
30
+ tokenizer = Tokenizer.from_pretrained('dustalov/wikitext-wordlevel')
31
+
32
+ tokenizer.encode("I'll see you soon").ids # => [68, 14, 2746, 577, 184, 595]
33
+
34
+ tokenizer.encode("I'll see you soon").tokens # => ['i', "'", 'll', 'see', 'you', 'soon']
35
+
36
+ tokenizer.decode([68, 14, 2746, 577, 184, 595]) # => "i ' ll see you soon"
37
+ ```