Younes Belkada commited on
Commit
a99e18c
1 Parent(s): 99b5974

update readme

Browse files
Files changed (1) hide show
  1. README.md +14 -1
README.md CHANGED
@@ -1,3 +1,16 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  # Japanese Dummy Tokenizer
2
 
3
  Repository containing a dummy Japanese Tokenizer trained on ```snow_simplified_japanese_corpus``` dataset. The tokenizer has been trained using Hugging Face datasets in a streaming manner.
@@ -16,4 +29,4 @@ tokenizer = AutoTokenizer.from_pretrained("ybelkada/japanese-dummy-tokenizer")
16
 
17
  ## How to train the tokenizer
18
 
19
- Check the file ```tokenizer.py```, you can freely adapt it to other datasets
 
1
+ ---
2
+ language: en, ja
3
+ license: mit
4
+ datasets:
5
+ - snow_simplified_japanese_corpus
6
+ tags:
7
+ - ja
8
+ - japanese
9
+ - tokenizer
10
+ widget:
11
+ - text: "誰が一番に着くか私には分かりません。"
12
+ ---
13
+
14
  # Japanese Dummy Tokenizer
15
 
16
  Repository containing a dummy Japanese Tokenizer trained on ```snow_simplified_japanese_corpus``` dataset. The tokenizer has been trained using Hugging Face datasets in a streaming manner.
 
29
 
30
  ## How to train the tokenizer
31
 
32
+ Check the file ```tokenizer.py```, you can freely adapt it to other datasets. This tokenizer is based on the tokenizer from ```csebuetnlp/mT5_multilingual_XLSum```.