Update README.md
Browse files
README.md
CHANGED
@@ -23,3 +23,15 @@ This is a simple word-level tokenizer created using the [Tokenizers](https://git
|
|
23 |
- Normalization: [NFC](https://en.wikipedia.org/wiki/Unicode_equivalence#Normal_forms) (Normalization Form Canonical Composition), Strip, Lowercase
|
24 |
- Pre-tokenization: Whitespace
|
25 |
- Code: [wikitext-wordlevel.py](wikitext-wordlevel.py)
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
23 |
- Normalization: [NFC](https://en.wikipedia.org/wiki/Unicode_equivalence#Normal_forms) (Normalization Form Canonical Composition), Strip, Lowercase
|
24 |
- Pre-tokenization: Whitespace
|
25 |
- Code: [wikitext-wordlevel.py](wikitext-wordlevel.py)
|
26 |
+
|
27 |
+
The tokenizer can be used as simple as follows.
|
28 |
+
|
29 |
+
```python
|
30 |
+
tokenizer = Tokenizer.from_pretrained('dustalov/wikitext-wordlevel')
|
31 |
+
|
32 |
+
tokenizer.encode("I'll see you soon").ids # => [68, 14, 2746, 577, 184, 595]
|
33 |
+
|
34 |
+
tokenizer.encode("I'll see you soon").tokens # => ['i', "'", 'll', 'see', 'you', 'soon']
|
35 |
+
|
36 |
+
tokenizer.decode([68, 14, 2746, 577, 184, 595]) # => "i ' ll see you soon"
|
37 |
+
```
|